Applying a function in each row of a big PySpark d

2019-05-18 07:09发布

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.

I tried:

dic = dict() for row in df.rdd.collect(): f(row, dic)

But I always meet the error OOM. I set the memory of Docker to 8GB.

How can I effectively perform the business?

Thanks a lot

标签： pyspark large-scale

2条回答

一纸荒年 Trace。

2楼-- · 2019-05-18 07:37

By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).

What could you do:

reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-05-18 07:51

Can you try something like below and let us know if it works for you?

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType

#sample data
df = sc.parallelize([
    ['a', 'b'],
    ['c', 'd'],
    ['e', 'f']
]).toDF(('col1', 'col2'))

#add logic to create dictionary element using rows of the dataframe    
def add_to_dict(l):
    d = {}
    d[l[0]] = l[1]
    return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()

#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list

Output is:

[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]

Hope this helps!

0人赞添加讨论(0) 举报

Applying a function in each row of a big PySpark d

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间