I have question on spark dataframe number of partitions.
If I have Hive table(employee) which has columns (name,age,id,location).
CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);
If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.
If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).
How many number of partitions will be created by Spark for a dataframe(df)?
df.rdd.partitions.size = ??
Partitions are created depending on the block size of HDFS.
Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then
no of partitions = (size of(10 partitions in MBs)) / 128MB
will be stored on HDFS.
Please refer to the following link:
http://www.bigsynapse.com/spark-input-output