I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends.
In the data-set all words a scored on a 7-point Likert-scale by 64 coders on four categories, which provides a mean for each word. What I want to do is take one of the dimensions and use this to analyse changes in emotions over time. I realise that Quanteda has a function for implementing the LIWC-dictionary, but I would prefer using the open-source ANEW-data if possible.
Essentially, I need help with the implementation because I am new to coding and R
The ANEW file looks something like this (in .csv):
WORD/SCORE: cancer: 1.01, potato: 3.56, love: 6.56
Not yet, directly, but... ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each term. This means you are not counting matches of values against a key, but rather selecting features and then scoring them using weighted counts.
This could be done in quanteda by:
Get ANEW features into a character vector.
Use
dfm(yourtext, select = ANEWfeatures)
to create a dfm with just the ANEW features.Multiple each counted value by the valence of each ANEW value, recycled column-wise so that each feature count gets multiplied by its ANEW value.
Use
rowSums()
on the weighted matrix to get document-level valence scores.or alternatively,
Note also that tidytext uses ANEW for its sentiment scoring, if you want to convert your dfm into their object and use that approach (which is basically a version of what I've suggested above).
Updated:
It turns out I already built the feature into quanteda that you need, and had simply not realised it!
This will work. First, load in the ANEW dictionary. (You have to supply the ANEW file yourself.)
Now that we have a named vector of weights, we can apply that using
dfm_weight()
. Below, I've first normalised the dfm by relative frequency, so that the document aggregate score is not dependent on the document length in tokens. If you don't want that, just remove the line indicated below.