Beginner NLP Question here:
How does the .similiarity method work?
Wow spaCy is great! Its tfidf model could be easier to preprocess, but w2v with only one line of code (token.vector)?! - Awesome!
In his 10 line tutorial on spaCy andrazhribernik show's us the .similarity method that can be run on tokens, sents, word chunks, and docs.
After nlp = spacy.load('en')
and doc = nlp(raw_text)
we can do .similarity queries between tokens and chunks.
However, what is being calculated behind the scenes in this .similarity
method?
SpaCy already has the incredibly simple .vector
, which computes the w2v vector as trained from the GloVe model (how cool would a .tfidf
or .fasttext
method be?).
Is the model similarity model simply computing the cosine similarity between these two w2v-GloVe-vectors or doing something else? The specifics aren't clear in the documentation; any help appreciated!
Found the answer, in short, it's yes:
Link to Souce Code
This looks like its the formula for computing cosine similarity and the vectors seem to be created with SpaCy's
.vector
which documentation says is trained from GloVe's w2v model.Assuming that the method you are referring to is the token similarity one, you can find the function in the sourcecode here. As you can see it computes the cosine similarity between the vectors.
As it says in the tutorial:
So the vector distance can be related to the word similarity.