Combine two RDDs in pyspark

2019-07-16 13:03发布

站内文章 / Spark

27 0

仙女界的扛把子

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Assuming that I have the following RDDs:

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

How do I combine these 2 RDD to one RDD which would be like this:

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

Using a.union(b) just combines them in a list. Any idea?

回答1:

You probably just want to b.zip(a) both RDDs (note the reversed order since you want to key by b's values).

Just read the python docs carefully:

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

标签： apache-spark pyspark rdd

仙女界的扛把子

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Combine two RDDs in pyspark

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮