Given a model, e.g.
from gensim.models.word2vec import Word2Vec
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
texts = [d.lower().split() for d in documents]
w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)
It's possible to remove the word from the w2v vocabulary, e.g.
# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]
>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)
# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]
# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"
But when we do a similarity on other words after deleting graph
, we see the word graph
popping up, e.g.
>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]
How to remove a word completely from a Word2Vec model in gensim?
Updated
To answer @vumaasha's comment:
could you give some details as to why you want to delete a word
Lets say my universe of words in all words in the corpus to learn the dense relations between all words.
But when I want to generate the similar words, it should only come from a subset of domain specific word.
It's possible to generate more than enough from
.most_similar()
then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.It would be better if the word is totally removed from the word vectors then the
.most_similar()
words won't return words outside of the specific domain.
I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.
It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.
Usage:
There is no direct way to do what you are looking for. However, you are not completely lost. The method
most_similar
is implemented in the classWordEmbeddingsKeyedVectors
(check the link). You can take a look at this method and modify it to suit your needs.The lines shown below perform the actual logic of computing the similar words, you need to replace the variable
limited
with vectors corresponding to words of your interest. Then you are doneUpdate:
If you see this line, it means if
restrict_vocab
is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab,self.vectors_norm
is what goes into limitedthe method most_similar calls another method
init_sims
. This initializes the value for[self.vector_norm][4]
like shown belowso, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work
Note that this does not trim the model per se. It trims the
KeyedVectors
object that the similarity look-ups is based on.Suppose you only want to keep the top 5000 words in your model.
This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.
The advantage of this is that if you write the KeyedVectors using methods such as
save_word2vec_format()
, the file is much smaller.Have tried and felt that the most straightforward way is as follows:
My sample code is as follows:
FYI. FAILED ATTEMPT I tried out @zsozso's method (with the
np.array
modifications suggested by @Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). This is perhaps because I have a lot of entities... But my text-file method works within an hour.FAILED CODE