ValueError: cannot reshape array of size 3800 into

2019-07-24 03:33发布

问题:

I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow:

def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_w2v[word].reshape((1, size))
            count += 1.
        except KeyError: # handling the case where the token is not in vocabulary

            continue
    if count != 0:
        vec /= count
    return vec

Next, when I try to Prepare word2vec feature set as follow:

wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200

for i in range(len(tokenized_tweet)):
    wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)

wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape

I get the following error inside the loop:

ValueError                                Traceback (most recent call last)
<ipython-input-32-72aee891e885> in <module>
      4 # wordvec_arrays.reshape(1,200)
      5 for i in range(len(tokenized_tweet)):
----> 6     wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
      7 
      8 wordvec_df = pd.DataFrame(wordvec_arrays)

<ipython-input-31-9e6501810162> in word_vector(tokens, size)
      4     for word in tokens:
      5         try:
----> 6             vec += model_w2v.wv.__getitem__(word).reshape((1, size))
      7             count += 1.
      8         except KeyError: # handling the case where the token is not in vocabulary

ValueError: cannot reshape array of size 3800 into shape (1,200)

I checked all the available posts in stackOverflow but non of them really helped me.

I tried reshaping the array and it still give me the same error.

My model is:

tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing

model_w2v = gensim.models.Word2Vec(
            tokenized_tweet,
            size=200, # desired no. of features/independent variables 
            window=5, # context window size
            min_count=2,
            sg = 1, # 1 for skip-gram model
            hs = 0,
            negative = 10, # for negative sampling
            workers= 2, # no.of cores
            seed = 34)

model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)

any suggestions please?

回答1:

It looks like the intent of your word_vector() method is to take a list of words, and then with respect to a given Word2Vec model, return the average of all those words' vectors (when present).

To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size, because that's forced by what the model already provides. You could use utility methods from numpy to simplify the code a lot. For example, the gensim n_similarity() method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:

https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996

So, while I haven't tested this code, I think your word_vector() method could be essentially replaced with:

import numpy as np

def average_words_vectors(tokens, wv_model):
    vectors = [wv_model[word] for word in tokens 
               if word in wv_model]  # avoiding KeyError
    return np.array(vectors).mean(axis=0)

(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim code via applying gensim.matutils.unitvec() to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)

Separate observations about your Word2Vec training code:

  • typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.

  • the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.