Doc1: ['And that was the fallacy. Once I was free to talk with staff members']
Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']
Doc3 : ['Another reality makes emotional intelligence ever more crucial']
Doc4: ['The globalization of the workforce puts a particular premium on emotional']
Doc5: ['As business changes, so do the traits needed to excel. Data tracking']
and this is a sample of my vocabulary:
my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]
The point is every word in my vocabulary is a bigram or trigram. My vocabulary includes all possible bigram and trigrams in my document set, I just gave you a sample here. Based on the application this is how my vocab should be. I am trying to use countVectorizer as following to:
from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set)
I am expecting to get something like this :
print tf:
(0, 126) 1
(0, 6804) 1
(0, 5619) 1
(0, 5019) 2
(0, 5012) 1
(0, 999) 1
(0, 996) 1
(0, 4756) 4
where the first column is the document ID, the second column is the word ID in the vocabulary and the third column is the occurrence number of that word in that document. But tf is empty. I know at the end of the day, I can write a code that goes through all the words in the vocabulary and computes the occurrence and makes the matrix, but can I use the countVectorizer for this input that I have and save time? Am I doing something wrong here? If countVectorizer is not the right way to do it, any recommendation will be appreciated.