I am using Quanteda's textstat_simil
to compute semantic relatedness in a text corpus.
The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html
This is a running example and it works fine:
# compute term similarities
pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english"))
(s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features"))
head(as.matrix(s1, 10)
as.list(s1, n=8)
I have two questions.
First question: what weighting scheme has been applied to the dfm's frequencies before computing the cosine similarity? Normally, in distributional models like this one, similarity measures (eg. cosine, dice, etc) are computed on weighed frequencies, and not on raw frequencies. Common weighing schemes are: PPMI (Positive Pointwise Mutual Information, TF/IDF, etc). Which weighing scheme has been applied here? Is it possible to use another scheme, if needed?
Second question: where can I find more details about how textstat_simil
options have been implemented in Quanteda? Namely, textstat_simil
options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann", and "faith"
.
In particular, I would like to know how simple matching, edice and ejaccard are computed.
Thanks in advance for your answers.
Cheers, Marina