How is SpaCy's similarity computed?

2019-07-10 05:56发布

问题:

Beginner NLP Question here:

How does the .similiarity method work?

Wow spaCy is great! Its tfidf model could be easier to preprocess, but w2v with only one line of code (token.vector)?! - Awesome!

In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs.

After nlp = spacy.load('en') and doc = nlp(raw_text) we can do .similarity queries between tokens and chunks. However, what is being calculated behind the scenes in this .similarity method?

SpaCy already has the incredibly simple .vector, which computes the w2v vector as trained from the GloVe model (how cool would a .tfidf or .fasttext method be?).

Is the model similarity model simply computing the cosine similarity between these two w2v-GloVe-vectors or doing something else? The specifics aren't clear in the documentation; any help appreciated!

回答1:

Assuming that the method you are referring to is the token similarity one, you can find the function in the sourcecode here. As you can see it computes the cosine similarity between the vectors.

As it says in the tutorial:

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

So the vector distance can be related to the word similarity.



回答2:

Found the answer, in short, it's yes:

Link to Souce Code

return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

This looks like its the formula for computing cosine similarity and the vectors seem to be created with SpaCy's .vector which documentation says is trained from GloVe's w2v model.