CountVectorizer deleting features that only appear

I'm using the sklearn python package, and I am having trouble creating a CountVectorizer with a pre-created dictionary, where the CountVectorizer doesn't delete features that only appear once or don't appear at all.

Here is the sample code that I have:

train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None)
test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())

print(len(train_count_vect.get_feature_names()))
print(len(test_count_vect.get_feature_names()))

len(train_count_vect.get_feature_names()) outputs 89967 len(test_count_vect.get_feature_names()) outputs 9833

Inside the setup_data() function, I am just initializing CountVectorizer. For training data, I'm initializing it without a preset vocabulary. Then, for test data, I'm initializing CountVectorizer with the vocabulary I retrieved from my training data.

How do I get the vocabularies to be the same lengths? I think sklearn is deleting features because they only appear once or don't appear at all in my test corpus. I need to have the same vocabulary because otherwise, my classifier will be of a different length from my test data points.

标签： python machine-learning scikit-learn text-classification

1条回答

爱情/是我丢掉的垃圾

2楼-- · 2020-03-26 08:47

So, it's impossible to say without actually seeing the source code of setup_data, but I have a pretty decent guess as to what is going on here. sklearn follows the fit_transform format, meaning there are two stages, specifically fit, and transform.

In the example of the CountVectorizer the fit stage effectively creates the vocabulary, and the transform step transforms your input text into that vocabulary space.

My guess is that you're calling fit on both datasets instead of just one, you need to be using the same "fitted" version of CountVectorizer on both if you want the results to line up. e.g.:

model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.transform(test_corpus)

Again, this can only be a guess until you post the setup_data function, but having seen this before I would guess you're doing something more like this:

model = CountVectorizer()
transformed_train = model.fit_transform(train_corpus)
transformed_test = model.fit_transform(test_corpus)

which will effectively make a new vocabulary for the test_corpus, which unsurprisingly won't give you the same vocabulary length in both cases.

0人赞添加讨论(0) 举报

CountVectorizer deleting features that only appear

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间