pyspark: add a new field to a data frame Row eleme

2020-07-14 09:08发布

I have the following element:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good') 

The Row is of spark data frame Row class. Can I append a new field to a, so a would look like:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good', name = u'john') 

Thanks!

2条回答
家丑人穷心不美
2楼-- · 2020-07-14 09:46

You cannot add new field to the Row. Row is a subclass of tuple

from pyspark.sql import Row

issubclass(Row, tuple)
## True

isinstance(Row(), tuple)
## True

and Python tuples are immutable. All you can do is create a new one:

row = Row(ts=1465326926253, myid=u'1234567', mytype=u'good') 

# In legacy Python: Row(name=u"john", **row.asDict())
Row(**row.asDict(), name=u"john") 
## Row(myid='1234567', mytype='good', name='john', ts=1465326926253)

Please note that Row keeps it fields sorted by name.

查看更多
狗以群分
3楼-- · 2020-07-14 09:50

Here is an updated answer that works. First you have to create a dictionary then update the dict and then write it out to a pyspark Row.

Code is as follows:

from pyspark.sql import Row

#Creating the pysql row
row = Row(field1=12345, field2=0.0123, field3=u'Last Field')

#Convert to python dict
temp = row.asDict()

#Do whatever you want to the dict. Like adding a new field or etc.
temp["field4"] = "it worked!"

# Save or output the row to a pyspark rdd
output = Row(**temp)

#How it looks
output

In [1]:
Row(field1=12345, field2=0.0123, field3=u'Last Field', field4='it worked!')
查看更多
登录 后发表回答