I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.
How can I extract feature importances from a RandomForestModel
regressor or classifier in PySpark?
Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
I don't see a model.__featureImportances_
attribute available -- where can I find this?
Feature importance is now implemented in Spark 1.5. See resolved JIRA issue. You can get a Vector of feature importances with:
As said, feature importance is not implemented.
Maybe this can be useful for you : https://github.com/wxhC3SC6OPm8M1HXboMy/spark-ml/blob/master/FeatureSelection.scala
I believe that this now works. You can call:
Running fit on a RandomForestClassifier returns a RandomForestClassificationModel which has the desired featureImportances calculated. I hope that this helps : )
UPDATE for version > 2.0.0
From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.
In fact, you can find here that:
If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.
Below there is an example that you can find here:
I have to disappoint you, but feature importances in MLlib implementation of RandomForest are just not calculated, so you cannot get them from anywhere except by by implementing their calculation on your own.
Here's how to find it out:
You call a function
RandomForest.trainClassifier
deinfed here https://github.com/apache/spark/blob/branch-1.3/python/pyspark/mllib/tree.pyIt calls for
callMLlibFunc("trainRandomForestModel", ...)
, which is a call to Scala functionRandomForest.trainClassifier
orRandomForest.trainRegressor
(depending on the algo), which return youRandomForestModel
object.This object is described in https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala and is extending
TreeEnsembleModel
defined in the same source file. And unfortunately this class stores only algorithm (regression or classification), trees themselves, relative weights of the trees and combining strategy (sum, avg, vote). It does not store feature importances, unfortunately, and does not even calculate them (see https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala for the calculation algorithm)