Number of Partitions of Spark Dataframe

2019-03-29 12:02发布

问题:

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6) 

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

回答1:

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

  • Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
  • Datasets created from RDD inherit number of partitions from its parent.
  • Datsets created using data source API:

    • In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
    • In Spark 2.x there is a Spark SQL specific configuration in use.
  • Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.