I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression?
Here is the code I have been trying, input file is here:
train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)
I keep getting the errror that IllegalArgumentException: u'Data type StringType is not supported.'
P.S.: Apologies for asking a basic question. I come from R background. In R, when we do Random Forests, there is no need to convert the categorical variables into numeric variables.
Yes you should use StringIndexer, maybe together with OneHotEncoder. You can find more information on these two in the linked documentation.