What is the recommended way to distribute a scikit

2020-07-24 07:13发布

问题:

I have built a classifier using scikit learn and now I would like to use spark to run predict_proba on a large dataset. I currently pickle the classifier once using:

import pickle
pickle.dump(clf, open('classifier.pickle', 'wb'))

and then in my spark code I broadcast this pickle using sc.broadcast for use in my spark code which has to load it in at each cluster node.

This works but the pickle is large (about 0.5GB) and it seems very inefficient.

Is there a better way to do this?

回答1:

This works but the pickle is large (about 0.5GB)

Note that the size of the forest will be O(M*N*Log(N)), where M is the number of trees and N is the number of samples. (source)

Is there a better way to do this?

There several options you can try to reduce the size of either your RandomForestClassifier model, or the serialized file:

  • reduce the size of the model by optimizing hyperparameters, in particular max_depth, max_leaf_nodes, min_samples_split as these parameters influence the size of the trees used in the ensemble

  • zip the pickle, e.g. as follows. Note there are several options and one might fit you better, so you'll need to try:

    with gzip.open('classifier.pickle', 'wb') as f:
        pickle.dump(clf, f)
    
  • use joblib instead of pickle, it compresses better and is also the recommended approach.

     from sklearn.externals import joblib
        joblib.dump(clf, 'filename.pkl') 
    

    The caveat here is that joblib will create multiple files in a directory, so you'll have to zip these up for transport.

  • last but not least you can also try reducing the size of the input by dimensionality reduction before you fit/predict using the RandomTreeClassifier, as mentioned in the practical tips on decision trees.

YMMV