How to write in global list with rdd?
Li = []
Fn(list):
If list.value == 4:
Li.append(1)
rdd.mapValues(lambda x:fn(x))
When I try to print Li the result is: []
What I'm trying to do is to transform another global liste Li1 while transforming the rdd object. However, when I do this I have always an empty list in the end. Li1 is never transformed.
The reason why you get Li
value set to []
after executing mapValue
s - is because Spark serializes Fn
function (and all global variables that it references - it is called closure) and sends to an another machine - worker.
But there is no exactly corresponding mechanism for sending results with closures back from worker to driver.
In order to receive results - you need to return from your function and use action like take()
or collect()
. But be careful - you don't want to send back more data than can fit into driver's memory - otherwise Spark app will throw out of memory exception.
Also you have not executed action on your RDD mapValues
transformation - so in your example no task were executed on workers.
rdd = sc.parallelize([(x, x+1) for x in range(2, 5)])
def Fn(value):
return value*2
Li = rdd.mapValues(lambda x:Fn(x)).collect()
print Li
would result in
[(2, 6), (3, 8), (4, 10)]
Edi
Following your problem description (based on my understanding of what you want to do):
L1 = range(20)
rdd = sc.parallelize(L1)
L2 = rdd.filter(lambda x: x % 2==0).collect()
print L2
>>> [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]