I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string.
c <- c('cat', 'dog', 'person')
d <- c('cat', 'dog', 'ufo')
stringdist(c, d, method='jaccard', q=2)
[1] 0 0 1
So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'.
I also tried converting the words into 1 long text string. The following approaches what I need, but it's still calculating 1 - (number of shared 2-grams / number of total unique 2-grams):
f <- 'cat dog person'
g <- 'cat dog ufo'
stringdist(f, g, method='jaccard', q=2)
[1] 0.5625
How would I get it to calculate similarity by the words?
You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use
seq_dist()
to calculate the distance.