Split up ngrams in document-feature matrix (quante

I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams?

head(dfm, n = 3, nfeature = 4)

docs       in_the great plenary emission_reduction
  10752099      3     1       1                  3
  10165509      8     0       0                  3
  10479890      4     0       0                  1

So, the above dfm would result in something like this:

head(dfm, n = 3, nfeature = 4)

docs       in great plenary emission the reduction
  10752099  3     1       1        3   3         3
  10165509  8     0       0        3   8         3
  10479890  4     0       0        1   4         1

For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").

Thank you in advance!

EDIT: The following can be used as reproducible example.

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

head(eg.dfm)

标签： r quanteda

1条回答

戒情不戒烟

2楼-- · 2019-08-18 19:29

I don't know if the best approach (it might use a lot of RAM since it turns the sparse dfm to a data.frame/matrix), but it should work :

# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)

Result :

> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
       features
docs    emission great in increase plenary reduction the
  text1        0     1  1        1       1         0   1
  text2        1     1  0        0       1         1   0
  text3        2     0  1        2       0         1   1

0人赞添加讨论(0) 举报

Split up ngrams in document-feature matrix (quante

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间