Doc2vec output data for only a single document and

2019-07-19 14:55发布

I try to build a simple program to test on my understanding about Doc2Vec and it seems like I still have a long way to go before knowing it.

I understand that each sentence in the document is first being labeled with its own label and for doc2vec it will learn vectors for these labels. For example, from what I could understand, lets say we have a list of lists with 3 sentences.

[["I have a pet"], ["They have a pet"], ["she has no pet"]]

We then break it into 3 sentences

["I have a pet"]
["They have a pet"]
["she has no pet"]

and use gensim TaggedDocument or any method you built to label each sentence with a label.

["I", "have", "a", "pet"] Tag= positive
["They", "have", "a", "pet"] Tag= positive
["she", "has", "no", "pet"] Tag= negative

Then we use the Doc2Vec gensim library to build the model, build_vocab and train it.

What I expected is that, each label for each sentence learn the vectors based on another sentence label; then out vectors for each label like in Word2Vec, but in word2vec the vectors are for each word.

If I did not misunderstand it, it should be something like this:

["I have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["They have a pet"] Vectors = [-0.13150065 -0.13182896 -0.1564866 ]
["she has no pet"] Vectors = [ 0.14937358 -0.06767108  0.14668389]

However, when I trained my model, I only get vectors one for positive and negative and a total of 2 instead of 3 like above. Is the vectors only build for each label; negative and positive and thats why it has 2 sparse vectors? If yes, then how can we compare the first sentence with second sentence and third sentence? I am getting quite confused when I received such output.

*** Is there a way to check which positive label is tag to which sentence? For example, how can I print tag + print sentence?

Example,

tag: positive sentence: ["They have a pet"]

My code:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument


file = [["I have a pet"], ["They have a pet"], ["she has no pet"]]

positiveFile = file[0:2]
negativeFile = file[2]

positive = [word.split() for sentence in positiveFile for word in sentence]
negative = [word.split() for sentence in [negativeFile] for word in sentence]
total = positive + negative

taggedPositiveFiles = [TaggedDocument(sentence, ["positive"])for i, sentence in enumerate(positive)]
taggedNegativeFiles = [TaggedDocument(sentence, ["negative"])for i, sentence in enumerate(negative)]
totalTagged = taggedNegativeFiles + taggedPositiveFiles


model = Doc2Vec(totalTagged, min_count = 1, workers=1, vector_size=3)
model.build_vocab(totalTagged, update=True)
model.train(totalTagged,total_examples=1, epochs=1)
print(model.docvecs["negative"])
print(model.docvecs["positive"])  

Current output:

[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108  0.14668389]

Expected output:

[-0.13150065 -0.13182896 -0.1564866 ]
[-0.13150065 -0.13182896 -0.1564866 ]
[ 0.14937358 -0.06767108  0.14668389]

Where did I misunderstand it? Please assist me. Thank you so much.

1条回答
仙女界的扛把子
2楼-- · 2019-07-19 15:30

You get to choose how you tag your texts. The Doc2Vec model only learns doc-vectors for the exact tags you provide.

In the original Paragraph Vectors paper upon which Doc2Vec is based (and many published examples since), every document gets its own unique ID tag, so there's a unique doc-vector per document. You get the doc-vector for a document by querying the model for that document's unique tag.

Using categorical labels, like 'positive' and 'negative' that may repeat across many examples, is possible and sometimes effective – but different from the original conception. If all your N texts only have among them 2 unique tags (repeating across texts), then at the end of training only 2 doc-vectors will be learned.

(It's also possible to give texts multiple tags – so they could have both a unique ID and some other label(s). For example: tags=['id001', 'positive']. However, that's best considered an advanced/experimental technique that I would only recommend after you've had simpler approaches work well, and understand from those simpler approaches how various qualities of your setup – like parameters, corpus size & quality, etc – affect results. In particular, trying to train more uniquely-tagged doc-vectors from the same amount of data can in practice mean each doc-vector is a bit 'weaker' in its usefulness. In essence, the same source info and "signal" in the data is spread out over more learned vectors. So you'd only want to do fancy things like have multiple tags per doc if you have a lot of data – and possibly even do more training passes.)

Other notes about your setup:

  • The update=True function of build_vocab() is only officially supported for Word2Vec, is an advanced feature that requires a lot of experimentation to use right, and even there should only be used the 2nd or subsequent times you build_vocab() on a model, not the 1st.

  • Toy-sized datasets generally won't give useful or intuitive results in Word2Vec/Doc2Vec – at best they can be used to understand parameter types/legality/output-sizes (as here).

  • Typical training passes (epochs) for a Doc2Vec model in published results are 10-20. (If you're trying to squeeze some usefulness out of tiny datasets, using more may help a bit, but it's always better to seek larger datasets.)

查看更多
登录 后发表回答