I made wordcloud using a csv file in R. I used TermDocumentMatrix
method in the tm
package. Here is my code:
csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)
Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))
m <- as.matrix(TDM)
This process seemed to take too much time. I think extractNoun
is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file)
completely? Or, is there a better alternative?
I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.
In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:
To set up parallelization for the tm commands the following has worked for me:
If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.
One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.
I'm not an expert but I've used NLP sometimes.
I do use
parSapply
fromparallel
package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdfparallel
comes with R base and this is a silly using example:So, parallelize
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
and it will be faster :)