I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams?
head(dfm, n = 3, nfeature = 4)
docs in_the great plenary emission_reduction
10752099 3 1 1 3
10165509 8 0 0 3
10479890 4 0 0 1
So, the above dfm would result in something like this:
head(dfm, n = 3, nfeature = 4)
docs in great plenary emission the reduction
10752099 3 1 1 3 3 3
10165509 8 0 0 3 8 3
10479890 4 0 0 1 4 1
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
Thank you in advance!
EDIT: The following can be used as reproducible example.
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
head(eg.dfm)
I don't know if the best approach (it might use a lot of RAM since it turns the sparse
dfm
to adata.frame/matrix
), but it should work :Result :