我正在使用带有 spark sql 的 hadoop。通常当我拉出 1000 列时,需要 1-2 秒,但是当超过 2000 列时,spark 会冻结并在几分钟内加载数据
spark-submit --deploy-mode client --master yarn --num-executors=40 --executor-cores=2 --executor-memory=5G /home/hdoop/SparkApplication/demo-0.0.1-SNAPSHOT.jar
@Configuration
@EnableAutoConfiguration(exclude = {org.springframework.boot.autoconfigure.gson.GsonAutoConfiguration.class})
public class SparkConfig {
@Value("Test Spark Application")
private String appName;
@Value("local[*]")
private String masterUri;
@Bean
public SparkConf sparkConf() {
return new SparkConf()
.setAppName(appName)
.setMaster(masterUri)
.set("spark.sql.debug.maxToStringFields", "1000")
.setJars(new String[]
{"/home/hdoop/SparkApplication/demo-0.0.1-SNAPSHOT.jar"
,"/home/hdoop/SparkApplication/spark-core_2.12-3.0.1.jar"
,"/home/hdoop/SparkApplication/postgresql-42.2.10.jar"});
}
@Service
@EnableAutoConfiguration(exclude = {org.springframework.boot.autoconfigure.gson.GsonAutoConfiguration.class})
public class StackOverFlow implements Serializable {
@Autowired
private SparkConf sparkConf;
public List<Order> getObject(String param, String value, Long limit) {
Encoder<Order> orderEncoder = Encoders.bean(Order.class);
SparkSession session = SparkSession
.builder()
.config(sparkConf)
.getOrCreate();
if (!session.sqlContext().isTraceEnabled()) {
SparkSession.setActiveSession(session);
}
Dataset<Row> jdbcDF = session.read()
.format("jdbc")
.option("url", "jdbc:postgresql:postgres:5432//orders")
.option("driver", "org.postgresql.Driver")
.option("query", "select * from orders o where " + param + " = '" + value + "' limit " + limit)
.option("user", "root")
.option("password", "password")
.load();
List<Order> orders = jdbcDF.as(orderEncoder).collectAsList();
session.stop();
return orders;
}
这里需要从两方面来看:
fetchsize文档中描述的调整和其他参数。例如,如果fetchsize默认值为 50,则需要执行 20 次读取才能获取 1000 行,如果增加到 1000 次,则每次都会发生这种情况。