Combine two RDDs in pyspark

2019-07-16 13:09发布

Assuming that I have the following RDDs:

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

How do I combine these 2 RDD to one RDD which would be like this:

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

Using a.union(b) just combines them in a list. Any idea?

标签： apache-spark pyspark rdd

1条回答

叛逆

2楼-- · 2019-07-16 13:42

You probably just want to b.zip(a) both RDDs (note the reversed order since you want to key by b's values).

Just read the python docs carefully:

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

0人赞添加讨论(0) 举报

Combine two RDDs in pyspark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间