I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel
.
How can I extract class probabilities from a RandomForestModel
classifier in PySpark?
Here's the sample code provided in the documentation that only provides the final class (not the probability):
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
I don't see any model.predict_proba()
method -- what should I do??
Probably people would have moved on with this post, but i was hit by the same problem today when trying to compute the accuracy for the multi-class classifier against a training set. So I thought I share my experience if someone is trying with mllib ...
probability can be computed fairly easy as follows:-
This is now available.
Spark ML provides:
predictionCol
that contains the predicted labelprobabilityCol
that contains a vector with the probabilities for each label, this is what you where looking for!For more details, here is the Spark Documentation: http://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns-predictions
As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much cleaner solution would have been to provide a probabilistic prediction which can be thresholded arbitrarily or used for ROC computation like in sklearn. This feature should be added for future releases!
As a workaround, I implemented predict_proba as a pure Python function (see example below). It is neither elegant nor very efficient, as it runs a loop over the set of individual decision trees in a forest. The trick - or rather a dirty hack - is to access the array of Java decision tree models and cast them into Python counterparts. After that you can compute individual model's predictions over the entire dataset and accumulate their sum in an RDD using 'zip'. Dividing by the number of trees gets the desired result. For large datasets, a loop over a small number of decision trees in a master node should be acceptable.
The code below is rather tricky due to the difficulties of integrating Python into Spark (run in Java). One should be very careful not to send any complex data to worker nodes, which results in crashes due to serialization problems. No code referring to the Spark context can be run on a worker node. Also, no code referring to any Java code can be serialized. For example, it may be tempting to use len(trees) instead of ntrees in the code below - bang! Writing such a wrapper in Java/Scala can be much more elegant, for example by running a loop over decision trees on worker nodes and hence reducing communication costs.
The test function below demonstrates that the predict_proba gives identical test error as predict used in original examples.
All-in-all, this was a nice exercise to learn Spark!
It will, however, be available with Spark 1.5.0 and the new Spark-ML API.