How to understand reduceByKey in Spark?

2019-04-17 15:48发布

I am trying to learn Spark and it has been going well so far, except for problems where I need to use functions like reduceByKey or combineByKey on a pair RDD whose values are lists.

I have been trying to find detailed documentation for these functions, that could explain what the arguments actually are, so that I could solve it myself without going to Stack Overflow, but I just cannot find any good documentation for Spark. I have read chapters 3 and 4 from Learning Spark, but to be honest, the explanations for the most complicated functions are very bad.

The problem I am dealing with right now is the following: I have a pair RDD where the key is a string and the value is a list of two elements which are both integers. Something like this: (country, [hour, count]). For each key, I wish to keep only the value with the highest count, regardless of the hour. As soon as I have the RDD in the format above, I try to find the maximums by calling the following function in Spark:

reduceByKey(lambda x, y: max(x[1], y[1]))

But this throws the following error:

TypeError: 'int' object is not subscriptable

Which does not make any sense to me. I interpreted the arguments x and y as being the values of two keys, e.g. x=[13, 445] and y=[14, 109], but then the error does not make any sense. What am I doing wrong?

1条回答
太酷不给撩
2楼-- · 2019-04-17 16:13

Lets say you have [("key", [13,445]), ("key", [14,109]), ("key", [15,309])]

When this is passed to reduceByKey, it will group all the values with same key into one executor i.e. [13,445], [14,109], [15,309] and iterates among the values

In the first iterate x is [13,445] and y is [14,109] and the output is max(x[1], y[1]) i.e. max(445, 109) which is 445

In the second iterate x is 445 i.e. max of previous loop and y is [15,309]

Now when the second element of x is tried to be obtained by x[1] and 445 is just an integer, the error occurs

TypeError: 'int' object is not subscriptable

I hope the meaning of the error is clear. You can find more details in my other answer

The above explanation also explains why the solution proposed by @pault in the comments section works i.e.

reduceByKey(lambda x, y: (x[0], max(x[1], y[1])))
查看更多
登录 后发表回答