I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable.
In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables.
How can I perform this with Spark?
If the categories can fit in the driver memory, here is my suggestion:
A VectorIndexer is coming in Spark 1.4 which might help you with this kind of feature transformation: http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer
However it looks like this will only be available in spark.ml rather than mllib
https://issues.apache.org/jira/browse/SPARK-4081
If I understood correctly you do not want to convert 1 categorical column in several dummy columns. You want spark to understand that the column is categorical and not numerical.
I think it depends on the algorithm you want to use right now. For example random Forest and GBT have both categoricalFeaturesInfo as a parameter check it here:
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$
so for example:
categoricalFeaturesInfo = Map[Int, Int]((1,2),(2,5))
is actually saying that second column of your features (index starts in 0, so 1 is second column) is a categorical one with 2 levels, and 3rd is also a categorical feature with 5 levels. You can specify these parameters when you train your randomForest or GBT.
You need to make sure your levels are mapped to 0,1,2... so if you have something like ("good","medium","bad") map it into (0,1,2).
Now in your case you want to use LogisticRegressionWithLBFGS. In this case my suggestion is to actually transform categorical columns into dummy columns. For example a single column with 3 levels ("good","medium","bad") into 3 columns with 1/0 depending on which one hits. I do not have an example to work with so here is a sample code in scala that should work:
That you can call it like:
Here I do it for a list of Categorical Columns (just in case you have more than 1 in your features list). First "for loop" goes through each categorical column, second "for loop" goes through each level in the column and creates a number of columns equals to the number of levels for each column.
Important!!! that it assumes that you first mapped them to 0,1,2...
You can then run your LogisticRegressionWithLBFGS using this new features set. This approach also helps with SVM.
Using VectorIndexer, you may tell the indexer the number of different values (cardinality) that a field may have in order to be considered categorical with the setMaxCategories() method.
From Scaladocs:
I find this a convenient (though coarse-grained) way to extract the categorical values, but beware if in any case you have a field with lower arity that you want to be continuous (e.g. age in college students vs origin country or US-state).