-->

How to convert gensim Word2Vec model to FastText m

2019-05-15 22:31发布

问题:

I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model?

回答1:

FastText is able to create vectors for subword fragments by including those fragments in the initial training, from the original corpus. Then, when encountering an out-of-vocabulary ('OOV') word, it constructs a vector for those words using fragments it recognizes. For languages with recurring word-root/prefix/suffix patterns, this results in vectors that are better than random guesses for OOV words.

However, the FastText process does not extract these subword vectors from final full-word vectors. Thus there's no simple way to turn full-word vectors into a FastText model that also includes subword vectors.

There might be workable way to approximate the same effect, for example by taking all known-words with the same subword fragment, and extracting some common average/vector-component to be assigned to the subword. Or modeling OOV words as some average of in-vocabulary words that are a short edit-distance from the OOV word. But these techniques wouldn't quite be FastText, just vaguely analogous to it, and how well they work, or could be made to work with tweaking, would be an experimental question. So, it's not a matter of grabbing an off-the-shelf library.

There are a couple of research papers with other OOV-bootstrapping ideas, mentioned in this blog post by Sebastien Ruder.

If you need the FastText OOV functionality, the best-grounded approach would be to train FastText vectors from scratch on the same corpus as was used for your traditional full-word-vectors.



回答2:

Here is the code snippet:

txt_model = KeyedVectors.load(model_name)
model.wv.save_word2vec_format('{}.txt'.format(model_name), binary=False)

Where model name is the name of the Word2Vec trained model.

However, gensim (since 3.2.0) has the following:

from gensim.models import FastText
model = FastText(sentences, workers=num_workers)
model.wv.save_word2vec_format('{}.txt'.format(model_name), binary=False)

BUT you'd still need to save it as a text file, because FastText cannot interpret binary word embeddings.