How to get vocabulary word count from gensim word2

2020-03-12 06:32发布

问题:

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?

回答1:

Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.

vocab_obj = w2v.vocab["word"]
vocab_obj.count

Output for google news w2v model: 2998437

So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.

for word, vocab_obj in w2v.vocab.items():
  #Do something with vocab_obj.count


回答2:

When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:

w2c = dict()
for item in model.wv.vocab:
    w2c[item]=model.wv.vocab[item].count

If you want to sort it to see the most frequent words in the model, you can also do that so:

w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))