I am doing text classification based on TF-IDF
Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF
Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF
value in vocabulary in each fold cross-validation?
Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF
value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF
SVM
and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).
My code is as follows:
# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)
scoring = ['accuracy']
clf = SVC(kernel='linear')
scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
print(scores)
My confusion is that whether my method doing TF-IDF
transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF
Vector Model Space using train data and then transform into TF-IDF
vectors with both train and test data? Just as follows:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC(kernel='linear')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)