I am trying to build a simple custom Estimator
in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator
. I also don't understand what @keyword_only
does and why do I need so many setters and getters. Scikit-learn seem to have a proper document for custom models (see here but PySpark doesn't.
Pseudo code of an example model:
class NormalDeviation():
def __init__(self, threshold = 3):
def fit(x, y=None):
self.model = {'mean': x.mean(), 'std': x.std()]
def predict(x):
return ((x-self.model['mean']) > self.threshold * self.model['std'])
def decision_function(x): # does ml-lib support this?
Generally speaking there is no documentation because as for Spark 1.6 / 2.0 most of the related API is not intended to be public. It should change in Spark 2.1.0 (see SPARK-7146).
API is relatively complex because it has to follow specific conventions in order to make given
Transformer
orEstimator
compatible withPipeline
API. Some of these methods may be required for features like reading and writing or grid search. Other, likekeyword_only
are just a simple helpers and not strictly required.Assuming you have defined following mix-ins for mean parameter:
standard deviation parameter:
and threshold:
you could create basic
Estimator
as follows:Finally it could be used as follows: