Number of Partitions of Spark Dataframe

2019-03-29 12:02发布

问题:

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6)

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

回答1:

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
- In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
- In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.