I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...
It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add).
I have already tried first converting the array to a string, by .map((x,y): (str(x),y))
but this does not work because post processing of the strings back into arrays is too slow.
Is there a way I can make pyspark use the array as a key or use another function to quickly convert the strings back to arrays?
here is the associated error code
File "/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py", line 268, in mergeValues
d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
enter code here
SUMMARY:
input:x =[([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...]
desired output :y =[([a,b,c], 2), ([a,d,b,e], 1),...]
such that I could access a
by y[0][0][0]
and 2
by y[0][1]
Try this:
Since Python lists are mutable it means that cannot be hashed (don't provide
__hash__
method):Tuples from the other hand are immutable and provide
__hash__
method implementation:hence can be used as a key. Similarly if you want to use unique values as a key you should use
frozenset
:instead of
set
.