I always use reduceByKey
when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey
, because I'm assuming that the performance of reduceByKey
will never be worse than groupByKey
. However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey
should be preferred??
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
I'll not invent the wheel, according to the code documentation, the
groupByKey
operation groups the values for each key in the RDD into a single sequence which also allows controlling the partitioning of the resulting key-value pair RDD by passing aPartitioner
.This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using
aggregateByKey
orreduceByKey
will provide much better performance.Note: As currently implemented,
groupByKey
must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OOME.As a matter of fact, I prefer the
combineByKey
operation, but it's sometime hard to understand the concept of the combiner and the merger if you are not very familiar with the map-reduce paradigm. For this, you can read the yahoo map-reduce bible here, which explains well this topic.For more information, I advice you to read the PairRDDFunctions code.
I believe there are other aspects of the problem ignored by climbage and eliasah:
If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to
GroupByKey
. Lets assume we haveRDD[(Int,String)]
:and we want to concatenate all strings for a given key. With
groupByKey
it is pretty simple:Naive solution with
reduceByKey
looks like this:It is short and arguably easy to understand but suffers from two issues:
String
object every time*To deal with the first problem we need a mutable data structure:
It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions
but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.
My point is if you really reduce amount of data by all means use
reduceByKey
. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.Note:
This answer is focused on Scala
RDD
API. Current Python implementation is quite different from its JVM counterpart and includes optimizations which provide significant advantage over naivereduceByKey
implementation in case ofgroupBy
-like operations.For
Dataset
API see DataFrame / Dataset groupBy behaviour/optimization.* See Spark performance for Scala vs Python for a convincing example
reduceByKey
andgroupByKey
both usecombineByKey
with different combine/merge semantics.They key difference I see is that
groupByKey
passes the flag (mapSideCombine=false
) to the shuffle engine. Judging by the issue SPARK-772, this is a hint to the shuffle engine to not run the mapside combiner when the data size isn't going to change.So I would say that if you are trying to use
reduceByKey
to replicategroupByKey
, you might see a slight performance hit.