I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.
If you save your model with:
Then load word2vec with
load_word2vec_format
method would cause the issue. To make it work you should use:The same thing also happen when you save model with:
And then, want to load with
KeyedVectors.load
method. In this situation, use:As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag
unicode_errors='ignore'
to skip this issue and load the model as you want.By default, this flag is set to 'strict':
unicode_errors='strict'
.According to the documentation, the following is given as the reason as to why errors like this occur.
All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.
I myself have experienced instances where I train multiple models using the original
word2vec.c file
, but when I try to load it intogensim
, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim