Pairwise Distance between documents

I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)

corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)

d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]

In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1).

         B.1       B.2       B.3       B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000

标签： r quanteda

1条回答

相关推荐>>

2楼-- · 2019-08-27 13:51

This will work, although as you point out, it doubles the cosine similarity matrix size in coercing the dist class return from textstat_simil() into a matrix.

d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000

Note that your use of ngrams=2 in the creation of dtm3 will create a dfm from only bigram features (which are quire infrequent). If you want unigrams as well as bigrams, then this should be ngrams = 1:2 instead.

That should work pretty well for most problems. If you are worried about the size of your object, you can either loop across individual selections of the dtm3, building up the target object, or lapply() the comparisons as follows (but this is much less efficient):

cosines <- lapply(docnames(corp2), 
                  function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
                                             method = "cosine",
                                             selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000

0人赞添加讨论(0) 举报

Pairwise Distance between documents

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间