pyspark: add a new field to a data frame Row eleme

2020-07-14 09:08发布

I have the following element:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good')

The Row is of spark data frame Row class. Can I append a new field to a, so a would look like:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good', name = u'john')

Thanks!

标签： python apache-spark dataframe row pyspark

2条回答

家丑人穷心不美

2楼-- · 2020-07-14 09:46

You cannot add new field to the Row. Row is a subclass of tuple

from pyspark.sql import Row

issubclass(Row, tuple)
## True

isinstance(Row(), tuple)
## True

and Python tuples are immutable. All you can do is create a new one:

row = Row(ts=1465326926253, myid=u'1234567', mytype=u'good') 

# In legacy Python: Row(name=u"john", **row.asDict())
Row(**row.asDict(), name=u"john") 
## Row(myid='1234567', mytype='good', name='john', ts=1465326926253)

Please note that Row keeps it fields sorted by name.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2020-07-14 09:50

Here is an updated answer that works. First you have to create a dictionary then update the dict and then write it out to a pyspark Row.

Code is as follows:

from pyspark.sql import Row

#Creating the pysql row
row = Row(field1=12345, field2=0.0123, field3=u'Last Field')

#Convert to python dict
temp = row.asDict()

#Do whatever you want to the dict. Like adding a new field or etc.
temp["field4"] = "it worked!"

# Save or output the row to a pyspark rdd
output = Row(**temp)

#How it looks
output

In [1]:
Row(field1=12345, field2=0.0123, field3=u'Last Field', field4='it worked!')

0人赞添加讨论(0) 举报

pyspark: add a new field to a data frame Row eleme

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间