Let's do some Text Mining
Here I stand with a document term matrix (from the tm
Package)
dtm <- TermDocumentMatrix(
myCorpus,
control = list(
weight = weightTfIdf,
tolower=TRUE,
removeNumbers = TRUE,
minWordLength = 2,
removePunctuation = TRUE,
stopwords=stopwords("german")
))
When I do a
typeof(dtm)
I see that it is a "list" and the structure looks like
Docs
Terms 1 2 ...
lorem 0 0 ...
ipsum 0 0 ...
... .......
So I try a
wordMatrix = as.data.frame( t(as.matrix( dtm )) )
That works for 1000 Documents.
But when I try to use 40000 it doesn't anymore.
I get this error:
Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt
Error in vector ... : Vector can't be NA Additional: In nr * nc NAs created by integer overflow
So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix. The convertion to a vector works but not the one from the vector to the matrix dosen't.
Do you have any suggestions what could be the problem?
Thanks, The Captain
Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :
This is the line referenced by the error message. What's going on, can be easily simulated by :
The function
vector()
takes an argument with the length, in this casenr*nc
If this is larger than appx. 2e9 (.Machine$integer.max
), it will be replaced by NA. This NA is not valid as an argument forvector()
.Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.
PS : I made a dtm object by
Based on Joris Meys answer, I've found the solution. "vector()" documentation regarding "length" argument
So we can make a tiny fix of the as.matrix():
Here is a very very simple solution I discovered recently
Please note that taking transpose of TDM to get DTM is absolutely optional, it's my personal preference to play with matrices this way
P.S.Could not answer the question 4 years back as I was just a fresh entry in my college