VectorAssembler does not support the StringType ty

2019-02-17 19:08发布

问题:

I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:

 val toDouble = udf[Double, String]( _.toDouble) 
 val analysisData  = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))              
 val assembler = new VectorAssembler()
    .setInputCols(Array("execution", "event", "info"))
    .setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())

when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type which means that my values are still strings! how can I convert the values and not only the schema type?

thanks

回答1:

Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.

To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:

val analysisData  = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))

It should work!