I am using Quanteda's textstat_simil
to compute semantic relatedness in a text corpus.
The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html
This is a running example and it works fine:
# compute term similarities
pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english"))
(s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features"))
head(as.matrix(s1, 10)
as.list(s1, n=8)
I have two questions.
First question: what weighting scheme has been applied to the dfm's frequencies before computing the cosine similarity? Normally, in distributional models like this one, similarity measures (eg. cosine, dice, etc) are computed on weighed frequencies, and not on raw frequencies. Common weighing schemes are: PPMI (Positive Pointwise Mutual Information, TF/IDF, etc). Which weighing scheme has been applied here? Is it possible to use another scheme, if needed?
Second question: where can I find more details about how textstat_simil
options have been implemented in Quanteda? Namely, textstat_simil
options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann", and "faith"
.
In particular, I would like to know how simple matching, edice and ejaccard are computed.
Thanks in advance for your answers.
Cheers, Marina
1) Unless you weight the dfm first using
dfm_weight()
, the dfm that is input totextstat_simil()
will be raw counts. (For cosine similarity, this produces the same result as relative term frequencies, since it is based on the angle between vectors rather than the distance between multi-dimensional coordinates.)2) The source code for the methods can be viewed here, where the formula are presented in simple form in the comments to the specific functions.