I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!
相关问题
- How to maintain order of key-value in DataFrame sa
- Unusual use of the new keyword
- Get Runtime Type picked by implicit evidence
- Spark on Yarn Container Failure
- What's the point of nonfinal singleton objects
相关文章
- Gatling拓展插件开发,check(bodyString.saveAs("key"))怎么实现
- Livy Server: return a dataframe as JSON?
- RDF libraries for Scala [closed]
- Why is my Dispatching on Actors scaled down in Akk
- How do you run cucumber with Scala 2.11 and sbt 0.
- GRPC: make high-throughput client in Java/Scala
- Setting up multiple test folders in a SBT project
- SQL query Frequency Distribution matrix for produc
I have already answered a similar question before.
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
And here is a note from @sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
Reference : source.
This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.
Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.
Nevertheless, this might seem a little complicated.
There is some thing you might need to consider :
update the
spark/config.file
to install you spark-1.5. Something like :this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.
To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)
EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.
Unfortunately this isn't possible with version 1.4.1, you could extend the random forest class and copy some of the code I added in that pull request if you can't upgrade - but be sure to switch back to the regular version once you are able to upgrade.
Spark 1.5.0 is now supported natively on EMR with the emr-4.1.0 release! No more need to use the emr-bootstrap-actions, which btw only work on 3.x AMIs, not emr-4.x releases.