scitkit SGDClassifier partial_fit doesnot learn in

2019-08-23 23:36发布

问题:

I passed two streams of data to sgd_clf classifier as shown in below code. First partial_fit is taking first stream of data x1,y1. Second partial_fit is taking the second stream of data x2,y2.

The below code gives me error at second partial_fit step that class lables to be included prior. This error is gone when i include all my data from x2 y2 in x1, y1. (My class labels are included prior to calling second partial_fit now)

However, i cannot give this x2 y2 data prior. If at all i give all my data before first partial_fit(), why is there any need for me to use second partial_fit() ? Infact, if i know all data before, i dont need to use partial_fit(), i could just do fit().

from sklearn import neighbors, linear_model
import numpy as np

def train_new_data():

    sgd_clf = linear_model.SGDClassifier()

    x1 = [[8, 9], [20, 22]]
    y1 = [5, 6]

    classes = np.unique(y1)

    #print(classes)

    sgd_clf.partial_fit(x1,y1,classes=classes)

    x2 = [10, 12]
    y2 = 8


    sgd_clf.partial_fit([x2], [y2],classes=classes)#Error here!!

    return sgd_clf

if __name__ == "__main__":

    print(train_new_data().predict([[20,22]]))

Q1: Is my understanding of partial_fit() for sklearn classifiers wrong that it takes data on the fly as specified here: Incremental Learning

Q2: I want to retrain a model/update a model with the new data. I dont want to train from scratch. Will partial_fit help me with this ?

Q3: I am not specific only to SGDClassifier. I can use any algorithm that support online/batch learning. My main intention is Q3. I have a trained model on 1000's of images. I dont want to retrain this model from scratch just because i have one/two new samples of images. Neither interested in creating a new model for each new entry and then mix all of them. This decreases my performance for predictions to search all over the trained models. I just want to add this new data instances to the trained model with the help of partial_fit. Is this feasible ?

Q4: If i cannot acheive Q2 with scikit classifiers, Please direct me how i can achieve this

Any suggestions or ideas or references are much appreciated.

回答1:

You need to know beforehand how many classes you are going to need. After the first call to partial fit, the algorithm assumes you will not have any new classes to add later.

In your example, you are added in a new class (y2 = 8) that has never been seen before and was not indicated as existing in your initial call to partial fit (that only contained the labels "5" and "6"). You need at add it to the classes object on the first call.

I would also recommend you number your classes starting from 0 just for consistency's sake.