I am trying to extract the identified dictionary words from a Quanteda dfm, but have been unable to find a solution.
Does someone have a solution for this?
Sample input:
dict <- dictionary(list(season = c("spring", "summer", "fall", "winter")))
dfm <- dfm("summer is great", dictionary = dict)
Output:
> dfm
Document-feature matrix of: 1 document, 1 feature.
1 x 1 sparse Matrix of class "dfmSparse"
features
docs season
text1 1
I now know that a seasonality dict word has been identified in the sentence, but I would also like to know which word it was.
This should preferably be extracted in the table format:
docs dict dictWord
text1 season summer
You can create a second dfm using the keptFeatures
argument, and then cbind()
it to the first, dictionaried dfm.
dict <- dictionary(list(season = c("spring", "summer", "fall", "winter")))
txt <- "summer is great"
season_dfm <- dfm(txt, dictionary = dict, verbose = FALSE)
dict_dfm <- dfm(txt, select = dict, verbose = FALSE)
cbind(season_dfm, dict_dfm)
## Document-feature matrix of: 1 document, 2 features.
## 1 x 2 sparse Matrix of class "dfmSparse"
## season summer
## text1 1 1
If you want the output as a table, it would be:
dict_df <- as.data.frame(combined_dfm)
names(dict_df)[2] <- "dictWord"
dict_df
## season dictWord
## text1 1 1
but that only works if you have a single dictionary value per text -- otherwise the dict_dfm
will have multiple feature columns.