My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
Consider the following code:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
The output:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
I have heard that I could use data.table
but I am not sure how.
I have also looked at the qdap
package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf
I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi
package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.
In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm
may give us (and certainly much faster than qdap
). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi
and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
Which approach?
data.table
is definitely the right way to go. Regex operations are slow, although the ones in stringi
are much faster (in addition to being much better). Anything with
I went through many iterations of solving problem in creating quanteda::dfm()
for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table
and Matrix
packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.
In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from
devtools::install_github("kbenoit/quanteda")
I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).
Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.
Core of the quanteda dfm()
function
Here is the bare bones of the data.table
based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm()
works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R
).
quanteda on your example
How's this for simplicity?
require(quanteda)
mytext <- c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.
# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs bar big child dog hold honor hunt let student
# text1 0 1 0 1 0 0 1 1 0
# text2 1 0 0 0 1 0 0 0 0
# text3 0 0 1 0 0 1 0 0 1
You have a few choices. @TylerRinker commented about qdap
, which is certainly a way to go.
Alternatively (or additionally) you could also benefit from a healthy does of parallelism. There's a nice CRAN page detailing HPC resources in R. It's a bit dated though and the multicore
package's functionality is now contained within parallel
.
You can scale up your text mining using the multicore apply
functions of the parallel
package or with cluster computing (also supported by that package, as well as by snowfall
and biopara
).
Another way to go is to employ a MapReduce
approach. A nice presentation on combining tm
and MapReduce
for big data is available here. While that presentation is a few years old, all of the information is still current, valid and relevant. The same authors have a newer academic article on the topic, which focuses on the tm.plugin.dc
plugin. To get around having a Vector Source instead of DirSource
you can use coercion:
data("crude")
as.DistributedCorpus(crude)
If none of those solutions fit your taste, or if you're just feeling adventurous, you might also see how well your GPU can tackle the problem. There's a lot of variation in how well GPUs perform relative to CPUs and this may be a use case. If you'd like to give it a try, you can use gputools
or the other GPU packages mentioned on the CRAN HPC Task View.
Example:
library(tm)
install.packages("tm.plugin.dc")
library(tm.plugin.dc)
GetDCorpus <-function(textVector)
{
doc.corpus <- as.DistributedCorpus(VCorpus(VectorSource(textVector)))
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, content_transformer(removeNumbers))
doc.corpus <- tm_map(doc.corpus, content_transformer(removePunctuation))
# <- tm_map(doc.corpus, removeWords, stopwords("english")) # won't accept this for some reason...
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
dcorp <- GetDCorpus(data[,1])
tdm <- TermDocumentMatrix(dcorp)
inspect(tdm)
Output:
> inspect(tdm)
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 10/20
Sparsity : 67%
Maximal term length: 7
Weighting : term frequency (tf)
Docs
Terms 1 2 3
barred 0 1 0
big 1 0 0
child 0 0 1
dogs 1 0 0
holds 0 1 0
honor 0 0 1
hunt 1 0 0
let 1 0 0
student 0 0 1
the 1 0 0