In Spark sc.newAPIHadoopRDD is reading 2.7 GB data

2019-06-04 00:34发布

站内文章 / Spark

21 0

乱世女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using spark 1.4 and I am trying to read the data from Hbase by using sc.newAPIHadoopRDD to read 2.7 GB data but there are 5 task are created for this stage and taking 2 t0 3 minutes to process it. Can anyone let me know how to increase the more partitions to read the data fast ?

回答1:

org.apache.hadoop.hbase.mapreduce.TableInputFormat creates a partition for each region. Your table seems to be split into 5 regions. Pre-splitting your table should increase the number of partitions (have a look here for more information on splitting).