可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel.

How can I extract class probabilities from a RandomForestModel classifier in PySpark?

Here's the sample code provided in the documentation that only provides the final class (not the probability):

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

I don't see any model.predict_proba() method -- what should I do??

回答1:

As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much cleaner solution would have been to provide a probabilistic prediction which can be thresholded arbitrarily or used for ROC computation like in sklearn. This feature should be added for future releases!

As a workaround, I implemented predict_proba as a pure Python function (see example below). It is neither elegant nor very efficient, as it runs a loop over the set of individual decision trees in a forest. The trick - or rather a dirty hack - is to access the array of Java decision tree models and cast them into Python counterparts. After that you can compute individual model's predictions over the entire dataset and accumulate their sum in an RDD using 'zip'. Dividing by the number of trees gets the desired result. For large datasets, a loop over a small number of decision trees in a master node should be acceptable.

The code below is rather tricky due to the difficulties of integrating Python into Spark (run in Java). One should be very careful not to send any complex data to worker nodes, which results in crashes due to serialization problems. No code referring to the Spark context can be run on a worker node. Also, no code referring to any Java code can be serialized. For example, it may be tempting to use len(trees) instead of ntrees in the code below - bang! Writing such a wrapper in Java/Scala can be much more elegant, for example by running a loop over decision trees on worker nodes and hence reducing communication costs.

The test function below demonstrates that the predict_proba gives identical test error as predict used in original examples.

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel. 
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

All-in-all, this was a nice exercise to learn Spark!

回答2:

This is now available.

Spark ML provides:

a predictionCol that contains the predicted label
and a probabilityCol that contains a vector with the probabilities for each label, this is what you where looking for!
You can also access the raw counts

For more details, here is the Spark Documentation: http://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns-predictions

回答3:

It will, however, be available with Spark 1.5.0 and the new Spark-ML API.

回答4:

Probably people would have moved on with this post, but i was hit by the same problem today when trying to compute the accuracy for the multi-class classifier against a training set. So I thought I share my experience if someone is trying with mllib ...

probability can be computed fairly easy as follows:-

# say you have a testset against which you want to run your classifier
   (trainingset, testset) =data.randomSplit([0.7, 0.3])
   # I converted the spark dataset containing the test data to pandas
     ptd=testData.toPandas()

   #Now get a count of number of labels matching the predictions

   correct = ((ptd.label-1) == (predictions)).sum() 
   # here we had to change the labels from 0-9 as opposed to 1-10 since
   #labels take the values from 0 .. numClasses-1

   m=ptd.shape[0]
   print((correct/m)*100)