How do we calculate the input data size and feed t

2019-04-15 23:05发布

站内文章 / Spark

27 0

来，给爷笑一个

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB

This will help me out to pass the number of partitions to repartition method.

InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)

Q1.How to calculate the value of XX ?

Q2.What is the similar approach for Spark SQL/DataFrame?

回答1:

The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.

You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.

The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.