UnicodeDecodeError error when loading word2vec

2019-08-06 19:35发布

问题:

Full Description

I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.

Short Description

I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.

Downloads

  • Kyubyong's pre-trained word2vector format word vectors for Portuguese;
  • Polyglot's pre-trained word vectors for Portuguese;

Loading attempts

Kyubyong's WordVectors First attempt: using Gensim as suggested by Hirosan;

from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)

And the error returned:

[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The zip downloaded also contains other files but all of them return similar errors.

Polyglot First attempt: following Al-Rfous's instructions;

import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))

And the error returned:

File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
    words, embeddings = pickle.load(open(polyglot_path, "rb"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)

Second attempt: using Polyglot's word embedding load function;

First, we have to install polyglot via pip:

pip install polyglot

Now we can import it:

from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)

And the error returned:

File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Extra Information

I am using python 3 on MacOS High Sierra.

Solutions

Kyubyong's WordVectors As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.

from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)

Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?

回答1:

For Kyubyong's pre-trained word2vector .bin file: it may have been saved using gensim's save function.

"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."

i.e., model = Word2Vec.load(fname)

Let me know if that works.

Reference : Gensim mailing list