Estimating document polarity using R's qdap pa

I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:

library(qdap)
polarity(DATA$state)$all$polarity
# Results:
 [1] -0.8165 -0.4082  0.0000 -0.8944  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000
Warning message:
In polarity(DATA$state) :
  Some rows contain double punctuation.  Suggested use of `sentSplit` function.

This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.

I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:

DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents 
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]

I was hoping I could run polarity on a version of the vector with periods removed, but it seems that sentSplit does more than that. This works on DATA but not on other sets of text (I'm unsure of the full set of breaks other than periods).

So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?

标签： r nlp sentiment-analysis qdap

2条回答

一夜七次

2楼-- · 2020-07-13 10:43

Looks like removing punctuation and other extras tricks polarity into thinking the vector is a single sentence:

SimplifyText <- function(x) {
  return(removePunctuation(removeNumbers(stripWhitespace(tolower(x))))) 
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
 [1] -0.8165 -0.4472  0.0000 -1.0000  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2020-07-13 10:47

Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

Here is the output:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

see the documentation for more.

0人赞添加讨论(0) 举报

Estimating document polarity using R's qdap pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间