I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm.
For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification.
The only difference I can find is that the one in org.apache.spark.ml is inherited from Estimator and was able to be used in cross validation. I was quite confused that they are placed in different packages. Is there anyone know the reason for it? Thanks!
It's JIRA ticket
And From Design Doc:
MLlib now covers a basic selection of machine learning algorithms, e.g., logistic regression, decision trees, alternating least squares, and k-means. The current set of APIs contains several design flaws that prevent us moving forward to
address practical machine learning pipelines,
make MLlib itself a scalable project.
The new set of APIs will live under org.apache.spark.ml
, and o.a.s.mllib
will be deprecated once we migrate all features to o.a.s.ml
.
The spark mllib guide says:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
and
Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
I think the doc explains it very well.