I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process.
Some queries are
How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present.
Also, StringIndexer maps categories to integers based on frequency of occurences. Now, when new data comes in, how do I make sure that this data is encoded consistently with training data? I sit possible o do that without StringIndexing the whole data again including the new data?
I quite confused on how to implement this.
Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use
StringIndexerModel
directly:One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.
Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.
or