Spark: How to “reduceByKey” when the keys are nump

2019-02-27 13:18发布

I have an RDD of (key,value) elements. The keys are NumPy arrays. NumPy arrays are not hashable, and this causes a problem when I try to do a reduceByKey operation.

Is there a way to supply the Spark context with my manual hash function? Or is there any other way around this problem (other than actually hashing the arrays "offline" and passing to Spark just the hashed key)?

Here is an example:

import numpy as np
from pyspark import SparkContext

sc = SparkContext()

data = np.array([[1,2,3],[4,5,6],[1,2,3],[4,5,6]])
rd = sc.parallelize(data).map(lambda x: (x,np.sum(x))).reduceByKey(lambda x,y: x+y)
rd.collect()

The error is:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

...

TypeError: unhashable type: 'numpy.ndarray'

标签： python numpy pyspark rdd

1条回答

闹够了就滚

2楼-- · 2019-02-27 13:35

The simplest solution is to convert it to an object that is hashable. For example:

from operator import add

reduced = sc.parallelize(data).map(
    lambda x: (tuple(x), x.sum())
).reduceByKey(add)

and convert it back later if needed.

Is there a way to supply the Spark context with my manual hash function

Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__ method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash but I doubt it is worth it even if you consider the cost of conversions.

0人赞添加讨论(0) 举报

Spark: How to “reduceByKey” when the keys are nump

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间