I try to build a simple program to test on my understanding about Doc2Vec
and it seems like I still have a long way to go before knowing it.
I understand that each sentence in the document is first being labeled with its own label and for doc2vec
it will learn vectors for these labels. For example, from what I could understand, lets say we have a list of lists with 3 sentences.
[["I have a pet"], ["They have a pet"], ["she has no pet"]]
We then break it into 3 sentences
["I have a pet"]
["They have a pet"]
["she has no pet"]
and use gensim TaggedDocument or any method you built to label each sentence with a label.
["I", "have", "a", "pet"] Tag= positive
["They", "have", "a", "pet"] Tag= positive
["she", "has", "no", "pet"] Tag= negative
Then we use the Doc2Vec
gensim
library to build the model, build_vocab and train it.
What I expected is that, each label for each sentence learn the vectors based on another sentence label; then out vectors for each label like in Word2Vec
, but in word2vec
the vectors are for each word.
If I did not misunderstand it, it should be something like this:
["I have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["They have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["she has no pet"] Vectors = [ 0.14937358 -0.06767108 0.14668389]
However, when I trained my model, I only get vectors one for positive and negative and a total of 2 instead of 3 like above. Is the vectors only build for each label; negative and positive and thats why it has 2 sparse vectors? If yes, then how can we compare the first sentence with second sentence and third sentence? I am getting quite confused when I received such output.
*** Is there a way to check which positive label is tag to which sentence? For example, how can I print tag + print sentence?
Example,
tag: positive sentence: ["They have a pet"]
My code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
file = [["I have a pet"], ["They have a pet"], ["she has no pet"]]
positiveFile = file[0:2]
negativeFile = file[2]
positive = [word.split() for sentence in positiveFile for word in sentence]
negative = [word.split() for sentence in [negativeFile] for word in sentence]
total = positive + negative
taggedPositiveFiles = [TaggedDocument(sentence, ["positive"])for i, sentence in enumerate(positive)]
taggedNegativeFiles = [TaggedDocument(sentence, ["negative"])for i, sentence in enumerate(negative)]
totalTagged = taggedNegativeFiles + taggedPositiveFiles
model = Doc2Vec(totalTagged, min_count = 1, workers=1, vector_size=3)
model.build_vocab(totalTagged, update=True)
model.train(totalTagged,total_examples=1, epochs=1)
print(model.docvecs["negative"])
print(model.docvecs["positive"])
Current output:
[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108 0.14668389]
Expected output:
[-0.13150065 -0.13182896 -0.1564866 ]
[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108 0.14668389]
Where did I misunderstand it? Please assist me. Thank you so much.
You get to choose how you
tag
your texts. TheDoc2Vec
model only learns doc-vectors for the exact tags you provide.In the original
Paragraph Vectors
paper upon whichDoc2Vec
is based (and many published examples since), every document gets its own unique ID tag, so there's a unique doc-vector per document. You get the doc-vector for a document by querying the model for that document's unique tag.Using categorical labels, like 'positive' and 'negative' that may repeat across many examples, is possible and sometimes effective – but different from the original conception. If all your N texts only have among them 2 unique tags (repeating across texts), then at the end of training only 2 doc-vectors will be learned.
(It's also possible to give texts multiple tags – so they could have both a unique ID and some other label(s). For example:
tags=['id001', 'positive']
. However, that's best considered an advanced/experimental technique that I would only recommend after you've had simpler approaches work well, and understand from those simpler approaches how various qualities of your setup – like parameters, corpus size & quality, etc – affect results. In particular, trying to train more uniquely-tagged doc-vectors from the same amount of data can in practice mean each doc-vector is a bit 'weaker' in its usefulness. In essence, the same source info and "signal" in the data is spread out over more learned vectors. So you'd only want to do fancy things like have multiple tags per doc if you have a lot of data – and possibly even do more training passes.)Other notes about your setup:
The
update=True
function ofbuild_vocab()
is only officially supported forWord2Vec
, is an advanced feature that requires a lot of experimentation to use right, and even there should only be used the 2nd or subsequent times youbuild_vocab()
on a model, not the 1st.Toy-sized datasets generally won't give useful or intuitive results in
Word2Vec
/Doc2Vec
– at best they can be used to understand parameter types/legality/output-sizes (as here).Typical training passes (
epochs
) for aDoc2Vec
model in published results are 10-20. (If you're trying to squeeze some usefulness out of tiny datasets, using more may help a bit, but it's always better to seek larger datasets.)