From word vector to document vector [text2vec]

2020-07-22 18:18发布

问题:

I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to proceed further, namely apply or transform these word vectors and attach them to each document in such a way that each document is represented by a vector (derived from its component words' vectors I assume), to be used as input in a classifier. I've run into some quick fixes online for short documents, but my documents are rather lengthy (movie subtitles) and there doesn't seem to be any guidance on how to proceed with such documents - or at least guidance matching my comprehension level; I have experience working with n-grams, dictionaries, and topic models, but word embeddings puzzle me.

Thank you!

回答1:

If your goal is to classify documents - I doubt any doc2vec approach will beat bag-of-words/ngrams. If you still want to try - common simple strategy short documents (< 20 words) is to represent document as weighted sum/average of word vectors.

You can obtain it by something like:

common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged =  normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

I'm not aware of any universal established methods to obtain good document vectors for long documents.