I'm using the sklearn python package, and I am having trouble creating a CountVectorizer
with a pre-created dictionary, where the CountVectorizer
doesn't delete features that only appear once or don't appear at all.
Here is the sample code that I have:
train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None)
test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())
print(len(train_count_vect.get_feature_names()))
print(len(test_count_vect.get_feature_names()))
len(train_count_vect.get_feature_names())
outputs 89967
len(test_count_vect.get_feature_names())
outputs 9833
Inside the setup_data()
function, I am just initializing CountVectorizer
. For training data, I'm initializing it without a preset vocabulary. Then, for test data, I'm initializing CountVectorizer with the vocabulary I retrieved from my training data.
How do I get the vocabularies to be the same lengths? I think sklearn is deleting features because they only appear once or don't appear at all in my test corpus. I need to have the same vocabulary because otherwise, my classifier will be of a different length from my test data points.