I try to build a simple program to test on my understanding about Doc2Vec
and it seems like I still have a long way to go before knowing it.
I understand that each sentence in the document is first being labeled with its own label and for doc2vec
it will learn vectors for these labels. For example, from what I could understand, lets say we have a list of lists with 3 sentences.
[["I have a pet"], ["They have a pet"], ["she has no pet"]]
We then break it into 3 sentences
["I have a pet"]
["They have a pet"]
["she has no pet"]
and use gensim TaggedDocument or any method you built to label each sentence with a label.
["I", "have", "a", "pet"] Tag= positive
["They", "have", "a", "pet"] Tag= positive
["she", "has", "no", "pet"] Tag= negative
Then we use the Doc2Vec
gensim
library to build the model, build_vocab and train it.
What I expected is that, each label for each sentence learn the vectors based on another sentence label; then out vectors for each label like in Word2Vec
, but in word2vec
the vectors are for each word.
If I did not misunderstand it, it should be something like this:
["I have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["They have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["she has no pet"] Vectors = [ 0.14937358 -0.06767108 0.14668389]
However, when I trained my model, I only get vectors one for positive and negative and a total of 2 instead of 3 like above. Is the vectors only build for each label; negative and positive and thats why it has 2 sparse vectors? If yes, then how can we compare the first sentence with second sentence and third sentence? I am getting quite confused when I received such output.
*** Is there a way to check which positive label is tag to which sentence? For example, how can I print tag + print sentence?
Example,
tag: positive sentence: ["They have a pet"]
My code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
file = [["I have a pet"], ["They have a pet"], ["she has no pet"]]
positiveFile = file[0:2]
negativeFile = file[2]
positive = [word.split() for sentence in positiveFile for word in sentence]
negative = [word.split() for sentence in [negativeFile] for word in sentence]
total = positive + negative
taggedPositiveFiles = [TaggedDocument(sentence, ["positive"])for i, sentence in enumerate(positive)]
taggedNegativeFiles = [TaggedDocument(sentence, ["negative"])for i, sentence in enumerate(negative)]
totalTagged = taggedNegativeFiles + taggedPositiveFiles
model = Doc2Vec(totalTagged, min_count = 1, workers=1, vector_size=3)
model.build_vocab(totalTagged, update=True)
model.train(totalTagged,total_examples=1, epochs=1)
print(model.docvecs["negative"])
print(model.docvecs["positive"])
Current output:
[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108 0.14668389]
Expected output:
[-0.13150065 -0.13182896 -0.1564866 ]
[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108 0.14668389]
Where did I misunderstand it? Please assist me. Thank you so much.