I am importing a CSV file (using spark-csv) into a DataFrame
which has empty String
values. When applied the OneHotEncoder
, the application crashes with error requirement failed: Cannot have an empty string for name.
. Is there a way I can get around this?
I could reproduce the error in the example provided on Spark ml page:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
It is annoying since missing/empty values is a highly generic case.
Thanks in advance, Nikhil
Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.
if the column contains null the OneHotEncoder fails with a NullPointerException. therefore i extended the udf to tanslate null values as well
Since the
OneHotEncoder
/OneHotEncoderEstimator
does not accept empty string for name, or you'll get the following error :This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)
I'll create an
UDF
to process the empty category :Then, I'll apply the UDF on the column :
Now, you can go back to your transformations
EDIT:
@Anthony 's solution in Scala :
I hope this helps!