There is a pyspark dataframe with missing values:
tbl = sc.parallelize([
Row(first_name='Alice', last_name='Cooper'),
Row(first_name='Prince', last_name=None),
Row(first_name=None, last_name='Lenon')
]).toDF()
tbl.show()
Here's the table:
+----------+---------+
|first_name|last_name|
+----------+---------+
| Alice| Cooper|
| Prince| null|
| null| Lenon|
+----------+---------+
I would like to create a new column as follows:
- if first name is None, take the last name
- if last name is None, take the first name
- if they are both present, concatenate them
- we can safely assume that at least one of them is present
I can construct a simple function:
def combine_data(row):
if row.last_name is None:
return row.first_name
elif row.first_name is None:
return row.last_name
else:
return '%s %s' % (row.first_name, row.last_name)
tbl.map(combine_data).collect()
I do get the correct result, but I can't append it to the table as a column: tbl.withColumn('new_col', tbl.map(combine_data))
results in AssertionError: col should be Column
What is the best way to convert the result of map
to a Column
? Is there a preferred way to deal with null
values?