I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.
A <- data.frame(name = c(
"X-ray right leg arteries",
"x-ray left shoulder",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"X-ray leg",
"xray right leg",
"X-ray right leg arteries"
), stringsAsFactors = F)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)
d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]
In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")
? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1)
.
B.1 B.2 B.3 B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000