I have a ipython notebook which has pyspark code and it works fine on my machine but when I try to run it on a different machine it throws error at this line (rdd3 line):
rdd2 = sc.parallelize(list1)
rdd3 = rdd1.zip(rdd2).map(lambda ((x1,x2,x3,x4), y): (y,x2, x3, x4))
list = rdd3.collect()
The error I get is:
ValueError Traceback (most recent call last)
<ipython-input-7-9daab52fc089> in <module>()
---> 16 rdd3 = rdd1.zip(rdd2).map(lambda ((x1,x2,x3,x4), y): (y,x2, x3, x4))
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in zip(self, other)
1960
1961 if self.getNumPartitions() != other.getNumPartitions():
-> 1962 raise ValueError("Can only zip with RDD which has the same number of partitions")
1963
1964 # There will be an Exception in JVM if there are different number
I don't know why is this error coming on one machine but not on another machine? ValueError: Can only zip with RDD which has the same number of partitions
zip
is generally speaking a tricky operation. It requires both RDDs not only to have the same number of partitions but also the same number of elements per partition.Excluding some special cases this is guaranteed only if both RDDs have the same ancestor and there are not shuffles and operations potentially changing number of elements (
filter
,flatMap
) between the common ancestor and the current state. Typically it means onlymap
(1-to-1) transformations.If you know that the order is otherwise preserved but the number of partitions or elements per partition differ you can use
join
with indices:Scala equivalent would be something like this: