I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
Personally I would drop columns with
NULL
values because there is no useful information there but you can replace nulls with empty arrays. First some imports:You can define an empty array of specific type as:
and combine it with
when
clause:or
coalesce
:and use it as:
so with example data:
the result will be: