How do we calculate the input data size and feed t

2019-04-15 23:05发布

问题:

Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB

This will help me out to pass the number of partitions to repartition method.

InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)

Q1.How to calculate the value of XX ?

Q2.What is the similar approach for Spark SQL/DataFrame?

回答1:

The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.

You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.

The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.

 df.partition(col)