I made wordcloud using a csv file in R. I used TermDocumentMatrix
method in the tm
package. Here is my code:
csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)
Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))
m <- as.matrix(TDM)
This process seemed to take too much time. I think extractNoun
is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file)
completely? Or, is there a better alternative?
I'm not an expert but I've used NLP sometimes.
I do use parSapply
from parallel
package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
parallel
comes with R base and this is a silly using example:
library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")
base <- 2
parSapply(cl, as.character(2:4),
function(exponent){
x <- as.numeric(exponent)
c(base = base^x, self = x^x)
})
So, parallelize nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
and it will be faster :)
I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.
In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:
tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).
To set up parallelization for the tm commands the following has worked for me:
library(parallel)
cores <- detectCores()
cl <- makeCluster(cores) # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus,
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)
If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.
clusterExport(cl, "clean")
One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.