Can the ANEW dictionary be used for sentiment anal

2019-04-02 05:31发布

问题:

I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends.

In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that Quanteda has a function for implementing the LIWC-dictionary, but I would prefer using the open-source ANEW-data if possible.

Essentially, I need help with the implementation because I am new to coding and R

The ANEW file looks something like this (in .csv):

WORD/SCORE: cancer: 1.01, potato: 3.56, love: 6.56

回答1:

Not yet, directly, but... ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each term. This means you are not counting matches of values against a key, but rather selecting features and then scoring them using weighted counts.

This could be done in quanteda by:

  1. Get ANEW features into a character vector.

  2. Use dfm(yourtext, select = ANEWfeatures) to create a dfm with just the ANEW features.

  3. Multiple each counted value by the valence of each ANEW value, recycled column-wise so that each feature count gets multiplied by its ANEW value.

  4. Use rowSums() on the weighted matrix to get document-level valence scores.

or alternatively,

  1. File an issue and we will add this functionality to quanteda.

Note also that tidytext uses ANEW for its sentiment scoring, if you want to convert your dfm into their object and use that approach (which is basically a version of what I've suggested above).

Updated:

It turns out I already built the feature into quanteda that you need, and had simply not realised it!

This will work. First, load in the ANEW dictionary. (You have to supply the ANEW file yourself.)

# read in the ANEW data
df_anew <- read.delim("ANEW2010All.txt", stringsAsFactors = FALSE)
# construct a vector of weights with the term as the name
vector_anew <- df_anew$ValMn
names(vector_anew) <- df_anew$Word

Now that we have a named vector of weights, we can apply that using dfm_weight(). Below, I've first normalised the dfm by relative frequency, so that the document aggregate score is not dependent on the document length in tokens. If you don't want that, just remove the line indicated below.

library("quanteda")
dfm_anew <- dfm(data_corpus_inaugural, select = df_anew$Word)

# weight by the ANEW weights
dfm_anew_weighted <- dfm_anew %>%
    dfm_weight(scheme = "prop") %>%   # remove if you don't want normalized scores
    dfm_weight(weights = vector_anew)
## Warning message:
## dfm_weight(): ignoring 1,427 unmatched weight features 

tail(dfm_anew_weighted)[, c("life", "day", "time")]
## Document-feature matrix of: 6 documents, 3 features (5.56% sparse).
## 6 x 3 sparse Matrix of class "dfm"
##               features
## docs                 life        day       time
##   1997-Clinton 0.07393220 0.06772881 0.21600000
##   2001-Bush    0.10004587 0.06110092 0.09743119
##   2005-Bush    0.09380645 0.12890323 0.11990323
##   2009-Obama   0.06669725 0.10183486 0.09743119
##   2013-Obama   0.08047970 0          0.19594096
##   2017-Trump   0.06826291 0.12507042 0.04985915

# total scores
tail(rowSums(dfm_anew_weighted))
## 1997-Clinton    2001-Bush    2005-Bush   2009-Obama   2013-Obama   2017-Trump 
##     5.942169     6.071918     6.300318     5.827410     6.050216     6.223944