So, I am trying to use the topicmodels
package for R
(100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.
So I try to shrink the size of the document term matrix that the lda()
function takes as input; I figure I can do that do using the minDocFreq
function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:
Here is the relevant bit of code:
> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1] 6423 41613
Same dimensions, and same number of columns (that is, same number of terms).
Any sense what I'm doing wrong? Thanks.