I'm using the sklearn python package, and I am having trouble creating a CountVectorizer
with a pre-created dictionary, where the CountVectorizer
doesn't delete features that only appear once or don't appear at all.
Here is the sample code that I have:
train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None)
test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names())
print(len(train_count_vect.get_feature_names()))
print(len(test_count_vect.get_feature_names()))
len(train_count_vect.get_feature_names())
outputs 89967
len(test_count_vect.get_feature_names())
outputs 9833
Inside the setup_data()
function, I am just initializing CountVectorizer
. For training data, I'm initializing it without a preset vocabulary. Then, for test data, I'm initializing CountVectorizer with the vocabulary I retrieved from my training data.
How do I get the vocabularies to be the same lengths? I think sklearn is deleting features because they only appear once or don't appear at all in my test corpus. I need to have the same vocabulary because otherwise, my classifier will be of a different length from my test data points.
So, it's impossible to say without actually seeing the source code of
setup_data
, but I have a pretty decent guess as to what is going on here.sklearn
follows thefit_transform
format, meaning there are two stages, specificallyfit
, andtransform
.In the example of the
CountVectorizer
thefit
stage effectively creates the vocabulary, and thetransform
step transforms your input text into that vocabulary space.My guess is that you're calling
fit
on both datasets instead of just one, you need to be using the same "fitted" version ofCountVectorizer
on both if you want the results to line up. e.g.:Again, this can only be a guess until you post the
setup_data
function, but having seen this before I would guess you're doing something more like this:which will effectively make a new vocabulary for the
test_corpus
, which unsurprisingly won't give you the same vocabulary length in both cases.