Predict Class Probabilities in Spark RandomForestC

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!

回答1:

I have already answered a similar question before.

Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.

There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.

And here is a note from @sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:

This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.

Reference : source.

This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.

Nevertheless, this might seem a little complicated.

There is some thing you might need to consider :

update the spark/config.file to install you spark-1.5. Something like :

+3  1.5.0   python  s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz

this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.
To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)

EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.

回答2:

Unfortunately this isn't possible with version 1.4.1, you could extend the random forest class and copy some of the code I added in that pull request if you can't upgrade - but be sure to switch back to the regular version once you are able to upgrade.

回答3:

Spark 1.5.0 is now supported natively on EMR with the emr-4.1.0 release! No more need to use the emr-bootstrap-actions, which btw only work on 3.x AMIs, not emr-4.x releases.