I have a big dataframe (~30M rows). I have a function f
. The business of f
is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict()
for row in df.rdd.collect():
f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?
Thanks a lot
By using
collect
you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).What could you do:
reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function
Can you try something like below and let us know if it works for you?
Output is:
Hope this helps!