分裂的（稀疏）文件特征矩阵的n-gram(Split up ngrams in (sparse) d

这是一个后续的问题这一个。在那里，我问是否有可能在例如双字母组产生两个独立的unigram进行这样的方式来分割的ngram的功能在文档特征矩阵（从quanteda包DFM级）。

为了更好地理解：我在DFM中的n-gram从翻译功能，从德国到英国。化合物（“Emissionsminderung”）是德国而不是在英语（“减排”）安静常见。

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

有一个很好的回答这个例子，它适用于比较小的矩阵作为上面的一个精绝。然而，只要矩阵更大，我一直运行到下面的内存错误。

> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) : 
  Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105

因此，是有解决这个的ngram-问题或处理大（稀疏）矩阵/数据帧一个以上存储器高效的方法？先感谢您！

这里的问题是，你正在转向稀疏（DFM）矩阵成致密物体，当你调用as.data.frame() 由于典型的文档特征矩阵为90％稀疏，这意味着你正在创建的东西大于你可以处理。解决办法：用DFM处理功能，以维持稀疏。

请注意，这既是一个更好的解决方案相比，所提出的链接的问题，但也应该提高工作效率为您更大的对象。

下面是做一个函数。它允许您设置串接字符（S）和可变大小的n元语法的作品。最重要的是，它使用DFM的方法，以确保DFM仍然稀疏。

# function to split and duplicate counts in features containing 
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
    # separate the unigrams
    x_unigrams <-  dfm_remove(x, concatenator, valuetype = "regex")

    # separate the ngrams
    x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
    # split into components
    split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
    # get a repeated index for the ngram feature names
    index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
    # subset the ngram matrix using the (repeated) ngram feature names
    x_split_ngrams <- x_ngrams[, index_split_ngrams]
    # assign the ngram dfm the feature names of the split ngrams
    colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)

    # return the column concatenation of unigrams and split ngrams
    suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}

所以：

dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
##        features
## docs    increase great plenary in the emission reduction emission increase
##   text1        1     1       1  1   1        0         0        0        0
##   text2        0     1       1  0   0        1         1        0        0
##   text3        1     0       0  1   1        1         1        1        1

在这里，分裂的n-gram在新的相同功能名称的“对unigram”的结果。您可以（重新）有效地将它们结合起来dfm_compress()

dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
##        features
## docs    increase great plenary in the emission reduction
##   text1        1     1       1  1   1        0         0
##   text2        0     1       1  0   0        1         1
##   text3        2     0       0  1   1        2         1