In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy leads lot of shuffling. It also suggests that for avoiding lot of shuffling ReduceByKey should be used because with reduceByKey data is combined so each partition outputs at most one value for each key to send over the network. Whereas shuffle with groupByKey, all the data is wastefully sent over the network and collected on the reduce workers. However, there is no direct API for reduceByKey in spark dataFrame. You need to convert dataFrame to RDD and then perform reduceByKey. So questions are – 1. Did anyone face similar issue and what actions taken to improve performance? 2. Does my selection of machines are not right? 3. GroupBy in spark 2.0 is already doing reduceByKey like optimization hence reduceByKey is not needed as DataFrame API?
Here is the code to groupBy -
val aggregtedDF: DataFrame = joinedDFWithOtherDfs.groupBy("Col1", "Col2").agg(min("Col3").alias("Col3"))