Spark: Mapping elements of an RDD using other elem

2019-09-10 23:30发布

Suppose I have an this rdd:

val r = sc.parallelize(Array(1,4,2,3))

What I want to do is create a mapping. e.g:

r.map(val => val + func(all other elements in r)).

Is this even possible?

标签： scala apache-spark rdd

3条回答

戒情不戒烟

2楼-- · 2019-09-10 23:49

Spark already supports Gradient Descent. Maybe you can take a look in how they implemented it.

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2019-09-11 00:04

I don't know if there is a more efficient alternative, but I would first create some structure like:

rdd = sc.parallelize([ (1, [4,2,3]), (4, [1,2,3]), (2, [1,4,3]), (3, [1,4,2]));
rdd = rdd.map(lambda (x,y) => x + func(y));

0人赞添加讨论(0) 举报

爷、活的狠高调

4楼-- · 2019-09-11 00:06

It's very likely that you will get an exception, e.g. bellow.

rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: x + sum(rdd.collect()))

i.e. you are trying to broadcast the RDD therefore.

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

To achieve this you would have to do something like this:

res = sc.broadcast(rdd.reduce(lambda a,b: a + b))
rdd = rdd.map(lambda x: x + res.value)

0人赞添加讨论(0) 举报

Spark: Mapping elements of an RDD using other elem

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间