可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:

library(qdap)
polarity(DATA$state)$all$polarity
# Results:
 [1] -0.8165 -0.4082  0.0000 -0.8944  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000
Warning message:
In polarity(DATA$state) :
  Some rows contain double punctuation.  Suggested use of `sentSplit` function.

This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.

I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:

DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents 
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]

I was hoping I could run polarity on a version of the vector with periods removed, but it seems that sentSplit does more than that. This works on DATA but not on other sets of text (I'm unsure of the full set of breaks other than periods).

So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?

回答1:

Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

Here is the output:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

see the documentation for more.

回答2:

Looks like removing punctuation and other extras tricks polarity into thinking the vector is a single sentence:

SimplifyText <- function(x) {
  return(removePunctuation(removeNumbers(stripWhitespace(tolower(x))))) 
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
 [1] -0.8165 -0.4472  0.0000 -1.0000  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000