Do I use the same Tfidf vocabulary in k-fold cross

2020-08-17 17:27发布

问题:

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation?

Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF SVM and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).

My code is as follows:

# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)

# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)

 scoring = ['accuracy']
 clf = SVC(kernel='linear')
 scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
 print(scores)

My confusion is that whether my method doing TF-IDF transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF Vector Model Space using train data and then transform into TF-IDF vectors with both train and test data? Just as follows:

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
    x_train, x_test = data_x[train_index], data_x[test_index]
    y_train, y_test = data_y[train_index], data_y[test_index]

    tfidf = TfidfVectorizer()
    x_train = tfidf.fit_transform(x_train)
    x_test = tfidf.transform(x_test)

    clf = SVC(kernel='linear')
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    score = accuracy_score(y_test, y_pred)
    print(score)

回答1:

The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.

If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.

The simpler way could be using pipeline and cross_validate.

Use this!

from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))

scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores) 

Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.