我想申请字嵌入在鸣叫。 我试图通过利用存在于该鸣叫如下词语向量的平均来创建针对每个鸣叫的载体:
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
接下来,当我尝试准备word2vec功能设置如下:
wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape
我得到的循环中出现以下错误:
ValueError Traceback (most recent call last) <ipython-input-32-72aee891e885> in <module> 4 # wordvec_arrays.reshape(1,200) 5 for i in range(len(tokenized_tweet)): ----> 6 wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200) 7 8 wordvec_df = pd.DataFrame(wordvec_arrays) <ipython-input-31-9e6501810162> in word_vector(tokens, size) 4 for word in tokens: 5 try: ----> 6 vec += model_w2v.wv.__getitem__(word).reshape((1, size)) 7 count += 1. 8 except KeyError: # handling the case where the token is not in vocabulary ValueError: cannot reshape array of size 3800 into shape (1,200)
我检查了计算器所有可用的职位,但他们不真的帮了我。
我试图重塑阵列,它仍然给了我同样的错误。
我的模型是:
tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing
model_w2v = gensim.models.Word2Vec(
tokenized_tweet,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 1, # 1 for skip-gram model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)
有什么建议吗?