Spark rdd write in global list

2019-03-06 16:58发布

问题:

How to write in global list with rdd?

 Li = []

 Fn(list):        
    If list.value == 4: 
        Li.append(1)
 rdd.mapValues(lambda x:fn(x))

When I try to print Li the result is: []

What I'm trying to do is to transform another global liste Li1 while transforming the rdd object. However, when I do this I have always an empty list in the end. Li1 is never transformed. 

回答1:

The reason why you get Li value set to [] after executing mapValues - is because Spark serializes Fn function (and all global variables that it references - it is called closure) and sends to an another machine - worker.

But there is no exactly corresponding mechanism for sending results with closures back from worker to driver.

In order to receive results - you need to return from your function and use action like take() or collect(). But be careful - you don't want to send back more data than can fit into driver's memory - otherwise Spark app will throw out of memory exception.

Also you have not executed action on your RDD mapValues transformation - so in your example no task were executed on workers.

rdd = sc.parallelize([(x, x+1) for x in range(2, 5)])

def Fn(value):
    return value*2

Li = rdd.mapValues(lambda x:Fn(x)).collect()

print Li

would result in

[(2, 6), (3, 8), (4, 10)]

Edi

Following your problem description (based on my understanding of what you want to do):

L1 = range(20)
rdd = sc.parallelize(L1)

L2 = rdd.filter(lambda x: x % 2==0).collect()

print L2
>>> [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]