Pairwise Distance between documents

2019-08-27 13:01发布

问题:

I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)

corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)

d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]

In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1).

         B.1       B.2       B.3       B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000

回答1:

This will work, although as you point out, it doubles the cosine similarity matrix size in coercing the dist class return from textstat_simil() into a matrix.

d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000

Note that your use of ngrams=2 in the creation of dtm3 will create a dfm from only bigram features (which are quire infrequent). If you want unigrams as well as bigrams, then this should be ngrams = 1:2 instead.

That should work pretty well for most problems. If you are worried about the size of your object, you can either loop across individual selections of the dtm3, building up the target object, or lapply() the comparisons as follows (but this is much less efficient):

cosines <- lapply(docnames(corp2), 
                  function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
                                             method = "cosine",
                                             selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000


标签: r quanteda