I have a gensim Word2Vec model computed in Python 2 like that:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
model = Word2Vec(LineSentence('enwiki.txt'), size=100,
window=5, min_count=5, workers=15)
model.save('w2v.model')
However, I need to use it in Python 3. If I try to load it,
import gensim
from gensim.models import Word2Vec
model = Word2Vec.load('w2v.model')
it results in an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128)
I suppose the problem is in differences in encoding between Python2 and Python3. Also it seems like gensim is using pickle to save/load models.
Is there a way to set encoding/pickle options so that the model loads properly? Or maybe use some external tool to convert the model file?
Recomputing it in Python 3 is not an option: it takes way too much time.
This indeed looks like a bug somewhere, as noted by memoselyk, and can be fixed in a way described in a comment to this answer.
So you have to add
encoding='latin1'
to a call to_pickle.loads
ingensim.utils.unpickle
, load the model in Python 3, then save it, and now you can revert this fix and load this new model in unmodified gensim with Python 3.