Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.
相关问题
- How to maintain order of key-value in DataFrame sa
- Unusual use of the new keyword
- Get Runtime Type picked by implicit evidence
- Spark on Yarn Container Failure
- What's the point of nonfinal singleton objects
相关文章
- Gatling拓展插件开发,check(bodyString.saveAs("key"))怎么实现
- Livy Server: return a dataframe as JSON?
- RDF libraries for Scala [closed]
- Why is my Dispatching on Actors scaled down in Akk
- How do you run cucumber with Scala 2.11 and sbt 0.
- GRPC: make high-throughput client in Java/Scala
- Setting up multiple test folders in a SBT project
- SQL query Frequency Distribution matrix for produc
Are you using Spark 1.6 rather than Spark 2.1? I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.
ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If
Estimator
supports multilclass classification out-of-the-box (for example random forest) you can use it directly:If model supports only binary classification (logistic regression) and extends
o.a.s.ml.classification.Classifier
you can use one-vs-rest strategy:MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
At least some of the examples use multiclass classification:
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing
label
andfeatures
:or
RDD[LabeledPoint]
.Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
Now let's define required transformers and process train
Dataset
:Please note that
indexer
is "fitted" on the train data. It simply means that categorical values used as the labels are converted todoubles
. To use classifier on a new data you have to transform it first using thisindexer
.Next we can train RF model:
and finally test it: