Number of partitions of a spark dataframe created

2019-05-23 11:13发布

I have question on spark dataframe number of partitions.

If I have Hive table(employee) which has columns (name,age,id,location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.

If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).

How many number of partitions will be created by Spark for a dataframe(df)?

df.rdd.partitions.size = ??

1条回答
SAY GOODBYE
2楼-- · 2019-05-23 11:56

Partitions are created depending on the block size of HDFS.

Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then

no of partitions = (size of(10 partitions in MBs)) / 128MB

will be stored on HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output

查看更多
登录 后发表回答