How can I cross combine (is this the correct way to describe?) the two RDDS?
input:
rdd1 = [a, b]
rdd2 = [c, d]
output:
rdd3 = [(a, c), (a, d), (b, c), (b, d)]
I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y))
, it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.
. I guess that means you can not nest action
as in the list comprehension, and one statement can only do one action
.
So as you have noticed you can't perform a transformation
inside another transformation
(note that flatMap
& map
are transformations
rather than actions
since they return RDDs). Thankfully, what your trying to accomplish is directly supported by another transformation in the Spark API - namely cartesian
(see http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD ).
So you would want to do rdd1.cartesian(rdd2)
.
You can use the cartesian transformation. Here's an example from the documentation:
>>> rdd = sc.parallelize([1,2])
>>> sorted(rdd.cartesian(rdd).collect())
[(1, 1), (1, 2), (2, 1), (2, 2)]
in your case, you'll do
rdd3 = rdd1.cartesian(rdd2)