I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow:
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
Next, when I try to Prepare word2vec feature set as follow:
wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape
I get the following error inside the loop:
ValueError Traceback (most recent call last)
<ipython-input-32-72aee891e885> in <module>
4 # wordvec_arrays.reshape(1,200)
5 for i in range(len(tokenized_tweet)):
----> 6 wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
7
8 wordvec_df = pd.DataFrame(wordvec_arrays)
<ipython-input-31-9e6501810162> in word_vector(tokens, size)
4 for word in tokens:
5 try:
----> 6 vec += model_w2v.wv.__getitem__(word).reshape((1, size))
7 count += 1.
8 except KeyError: # handling the case where the token is not in vocabulary
ValueError: cannot reshape array of size 3800 into shape (1,200)
I checked all the available posts in stackOverflow but non of them really helped me.
I tried reshaping the array and it still give me the same error.
My model is:
tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing
model_w2v = gensim.models.Word2Vec(
tokenized_tweet,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 1, # 1 for skip-gram model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)
any suggestions please?
It looks like the intent of your word_vector()
method is to take a list of words, and then with respect to a given Word2Vec
model, return the average of all those words' vectors (when present).
To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size
, because that's forced by what the model already provides. You could use utility methods from numpy
to simplify the code a lot. For example, the gensim
n_similarity()
method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:
https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996
So, while I haven't tested this code, I think your word_vector()
method could be essentially replaced with:
import numpy as np
def average_words_vectors(tokens, wv_model):
vectors = [wv_model[word] for word in tokens
if word in wv_model] # avoiding KeyError
return np.array(vectors).mean(axis=0)
(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim
code via applying gensim.matutils.unitvec()
to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)
Separate observations about your Word2Vec
training code:
typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5
. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.
the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.