Python vectorization for classification [duplicate

This question already has an answer here:

Scikit learn - fit_transform on the test set 1 answer

I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).

####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421


####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)

####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()


### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619


clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())

How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?

标签： python scikit-learn vectorization random-forest

2条回答

smile是对你的礼貌

2楼-- · 2019-05-14 21:52

Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-05-14 22:03

You cannot introduce new features into the test set that were not part of your training set. The model is trained on a specific dictionary of terms and that same dictionary of terms must be used across training, validating, testing, and production. Further more, the indices of the words in your feature vector cannot change either.

You should be creating one large matrix using all of your data and then split the rows into your train and test sets. This will guarantee that you will have the same feature set for train and test.

0人赞添加讨论(0) 举报

Python vectorization for classification [duplicate

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间