I have a very big dataset that can not be loaded in memory.
I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression
.
Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?
I believe that some of the classifiers in sklearn
have a partial_fit
method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit
, release the minibatch from memory, and repeat.
If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier
, which can be set to use logistic regression when loss = 'log'
.
You simply pass the features and labels for your minibatch to partial_fit
in the same way that you would use fit
:
clf.partial_fit(X_minibatch, y_minibatch)
Update:
I recently came across the dask-ml
library which would make this task very easy by combining dask
arrays with partial_fit
. There is an example on the linked webpage.
Have a look at the scaling strategies included in the sklearn
documentation:
http://scikit-learn.org/stable/modules/scaling_strategies.html
A good example is provided here:
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html