how to read and write TermDocumentMatrix in r?

2019-08-02 01:38发布

I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code:

csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)

Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))

myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))

m <- as.matrix(TDM)

This process seemed to take too much time. I think extractNoun is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file) completely? Or, is there a better alternative?

2条回答
Luminary・发光体
2楼-- · 2019-08-02 02:34

I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.

In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:

tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).

To set up parallelization for the tm commands the following has worked for me:

library(parallel)
cores <- detectCores()
cl <- makeCluster(cores)   # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus, 
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)

If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.

clusterExport(cl, "clean") 

One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.

查看更多
家丑人穷心不美
3楼-- · 2019-08-02 02:37

I'm not an expert but I've used NLP sometimes.

I do use parSapply from parallel package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

parallel comes with R base and this is a silly using example:

library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")

base <- 2
parSapply(cl, as.character(2:4), 
          function(exponent){
            x <- as.numeric(exponent)
            c(base = base^x, self = x^x)
          })

So, parallelize nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F) and it will be faster :)

查看更多
登录 后发表回答