I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here?
相关问题
- How to maintain order of key-value in DataFrame sa
- Unusual use of the new keyword
- Get Runtime Type picked by implicit evidence
- Spark on Yarn Container Failure
- What's the point of nonfinal singleton objects
相关文章
- Gatling拓展插件开发,check(bodyString.saveAs("key"))怎么实现
- Livy Server: return a dataframe as JSON?
- RDF libraries for Scala [closed]
- Why is my Dispatching on Actors scaled down in Akk
- How do you run cucumber with Scala 2.11 and sbt 0.
- GRPC: make high-throughput client in Java/Scala
- Setting up multiple test folders in a SBT project
- SQL query Frequency Distribution matrix for produc
Unfortunately the feature is not available in the older Spark MLlib 1.5.1.
You can however find it in the recent Pipeline API in Spark MLlib 2.x as
RandomForestClassifier
:Note: This example is from the official documentation of Spark MLlib's ML - Random forest classifier.
And here is some explanation on some output columns :
predictionCol
represents the predicted label .rawPredictionCol
represents a Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction (available for Classification only).probabilityCol
represents the probability Vector of length # classes equal torawPrediction
normalized to a multinomial distribution (available with Classification only).You can't directly get the classification probabilities but it is relatively easy to calculate it yourself. RandomForest is an ensemble of trees and its output probability is the majority vote of these trees divided by the total number of trees.
Since the RandomForestModel in MLib gives you the trained trees it is easy to do it yourself. The following code gives the probability for the binary classification problem. Its generalization to multi-class classification is straightforward.
}
for multi-class case you only need to replace map with .map(_.predict(point.features)-> 1.0) and group by key instead of sum and finally take the max of values.