Remove empty documents from DocumentTermMatrix in

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:

corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) 
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))

And then performing LDA:

LDA(dtm, 30)

This final call to LDA() returns the error

  "Each row of the input matrix needs to contain at least one non-zero entry".

I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?

I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.

标签： r lda topic-modeling topicmodels

5条回答

唯我独甜

2楼-- · 2019-01-13 02:00

agstudy's answer works great, but using it on a slow computer proved mildly problematic.

tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed

(this was done with a 4000x15000 dtm)

The bottleneck appears to be applying sum() to a sparse matrix.

A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.

tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed

ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-01-13 02:08

"Each row of the input matrix needs to contain at least one non-zero entry"

The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2019-01-13 02:16

Just small addendum to the answer of Dario Lacan:

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]

will collect record's id, rather than order numbers. Try this:

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"

If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:

corpus <- tm_filter(
  corpus,
  FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
  # !( meta(doc)$id %in% emptyRows )
)

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

5楼-- · 2019-01-13 02:20

This is just to elaborate on the answer given by agstudy.

Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.

This is useful to keep a 1:1 correspondence between the dtm and the corpus.

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]] corpus <- corpus[-as.numeric(empty.rows)]

0人赞添加讨论(0) 举报

来，给爷笑一个

6楼-- · 2019-01-13 02:24

Just remove the sparse terms from the DTM and all will work well.

dtm <- DocumentTermMatrix(crude, sparse=TRUE)

0人赞添加讨论(0) 举报

Remove empty documents from DocumentTermMatrix in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间