Beginner NLP Question here:

How does the .similiarity method work?

Wow spaCy is great! Its tfidf model could be easier to preprocess, but w2v with only one line of code (token.vector)?! - Awesome!

In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs.

After nlp = spacy.load('en') and doc = nlp(raw_text) we can do .similarity queries between tokens and chunks. However, what is being calculated behind the scenes in this .similarity method?

SpaCy already has the incredibly simple .vector, which computes the w2v vector as trained from the GloVe model (how cool would a .tfidf or .fasttext method be?).

Is the model similarity model simply computing the cosine similarity between these two w2v-GloVe-vectors or doing something else? The specifics aren't clear in the documentation; any help appreciated!

标签： python machine-learning nlp word2vec spacy

2条回答

再贱就再见

2楼-- · 2019-07-10 05:58

Found the answer, in short, it's yes:

Link to Souce Code

return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

This looks like its the formula for computing cosine similarity and the vectors seem to be created with SpaCy's .vector which documentation says is trained from GloVe's w2v model.

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-07-10 06:00

Assuming that the method you are referring to is the token similarity one, you can find the function in the sourcecode here. As you can see it computes the cosine similarity between the vectors.

As it says in the tutorial:

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

So the vector distance can be related to the word similarity.

0人赞添加讨论(0) 举报

How is SpaCy's similarity computed?

How does the .similiarity method work?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间