Assuming that I have the following RDDs:
a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])
How do I combine these 2 RDD to one RDD which would be like this:
[('a', 1), ('c', 2), ('d', 5), ('e', 3)]
Using a.union(b)
just combines them in a list. Any idea?
You probably just want to b.zip(a)
both RDDs (note the reversed order since you want to key by b's values).
Just read the python docs carefully:
zip(other)
Zips this RDD with another one, returning key-value pairs with the
first element in each RDD second element in each RDD, etc. Assumes
that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through a map
on the other).
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]