I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.
The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.
Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?
There's no built-in feature that does exactly that, but it shouldn't require much code, and could be modeled on existing
gensim
code. A few possible alternative strategies:Load the full vectors, then save in an easy-to-parse format - such as via
.save_word2vec_format(..., binary=False)
. This format is nearly self-explanatory; write your own code to drop all lines from this file that aren't on your whitelist (being sure to update the leading line declaration of entry-count). The existing source code forload_word2vec_format()
&save_word2vec_format()
may be instructive. You'll then have a subset file.Or, pretend you were going to train a new Word2Vec model, using your corpus-of-interest (with just the interesting words). But, only create the model and do the
build_vocab()
step. Now, you have untrained model, with random vectors, but just the right vocabulary. Grab the model'swv
property - aKeyedVectors
instance with that right vocabulary. Then separately load the oversized vector-set, and for each word in the right-sizedKeyedVectors
, copy over the actual vector from the larger set. Then save the right-sized subset.Or, look at the (possibly-broken-since-gensim-3.4) method on Word2Vec
intersect_word2vec_format()
. It more-or-less tries to do what's described in (2) above: with an in-memory model that has the vocabulary you want, merge in just the overlapping words from another word2vec-format set on disk. It'll either work, or provide the template for what you'd want to do.Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.
we have all our minor set of words in
restricted_word_set
(it can be either list or set) andw2v
is our model, so here is the function:It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.
Usage:
it can be used for removing some words either.