I have a Spark 1.5.0 DataFrame with a mix of null
and empty strings in the same column. I want to convert all empty strings in all columns to null
(None
, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
You can easily rewrite the function above in Python.
I learned this trick from @liancheng
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
It is as simple as this:
If you want to fill multiple columns you can for example reduce:
or use comprehension:
If you want to specifically operate on string fields please check the answer by robin-loxley.