I'd like to apply qdap
's polarity
function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:
library(qdap)
polarity(DATA$state)$all$polarity
# Results:
[1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000
[10] 0.4082 0.0000
Warning message:
In polarity(DATA$state) :
Some rows contain double punctuation. Suggested use of `sentSplit` function.
This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.
I'm aware of the option to first run sentSplit
and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:
DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]
I was hoping I could run polarity
on a version of the vector with periods removed, but it seems that sentSplit
does more than that. This works on DATA
but not on other sets of text (I'm unsure of the full set of breaks other than periods).
So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?