GroupBy Operation of DataFrame takes lot of time i

2019-04-11 13:44发布

In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy leads lot of shuffling. It also suggests that for avoiding lot of shuffling ReduceByKey should be used because with reduceByKey data is combined so each partition outputs at most one value for each key to send over the network. Whereas shuffle with groupByKey, all the data is wastefully sent over the network and collected on the reduce workers. However, there is no direct API for reduceByKey in spark dataFrame. You need to convert dataFrame to RDD and then perform reduceByKey. So questions are – 1. Did anyone face similar issue and what actions taken to improve performance? 2. Does my selection of machines are not right? 3. GroupBy in spark 2.0 is already doing reduceByKey like optimization hence reduceByKey is not needed as DataFrame API?

Here is the code to groupBy -

val aggregtedDF: DataFrame = joinedDFWithOtherDfs.groupBy("Col1", "Col2").agg(min("Col3").alias("Col3"))

标签： apache-spark spark-dataframe emr amazon-emr

0条回答

GroupBy Operation of DataFrame takes lot of time i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间