I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.
A <- data.frame(name = c(
"X-ray right leg arteries",
"x-ray left shoulder",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"X-ray leg",
"xray right leg",
"X-ray right leg arteries"
), stringsAsFactors = F)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)
d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]
In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")
? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1)
.
B.1 B.2 B.3 B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000
This will work, although as you point out, it doubles the cosine similarity matrix size in coercing the
dist
class return fromtextstat_simil()
into amatrix
.Note that your use of
ngrams=2
in the creation ofdtm3
will create a dfm from only bigram features (which are quire infrequent). If you want unigrams as well as bigrams, then this should bengrams = 1:2
instead.That should work pretty well for most problems. If you are worried about the size of your object, you can either loop across individual selections of the
dtm3
, building up the target object, orlapply()
the comparisons as follows (but this is much less efficient):