Inconsistent behaviour with tm_map transformation

2020-03-08 09:37发布

问题:

Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?"

I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package.

Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers(). The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this doc will be transformed to just "n".

Sample corpus is below for reproduction. Here is the code block:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)

  corpus <- (see below)
  n <- 100 # this is the size of each chunk in the loop

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # save memory

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile() 
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # save memory

  # doparallel
  registerDoParallel(cores = 12)
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # regular transformations        
    piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
    piece <- tm_map(piece, removeNumbers)
    saveRDS(piece, paste0(tmpfile, i, ".rds"))
    return(1) # hack to get dopar to forget the piece to save memory since now saved to rds
  } 

  stopImplicitCluster()

  # combine the pieces back into one corpus
  corpus <- list()
  corpus <- foreach(i = seq_len(lenp)) %do% {
    corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
  }
  corpus_done <- do.call(function(...) c(..., recursive = TRUE), corpus)

And here is a link to sample data. I need to paste a sufficiently large sample of 2k docs to recreate and SO won't let me paste that much so please see linked doc for data.

corpus <- VCorpus(VectorSource([paste the chr vector from link above]))

If I run my code block as above with n = to 200 then look at the results I can see that numbers remain where they should have been removed by tm::removeNumbers()

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n417"
[1] "disturbance"
[1] "grand theft auto"

However if I change the chunk size (the value of "n" variable) to 100:

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n"
[1] "disturbance"
[1] "grand theft auto"

The numbers have been removed.

But, this is inconsistent. I tried to narrow it down by testing on 150, then 125 ... and found that it would/would not work between 120 and 125 chunk size. Then after iterating the function between 120:125 it would sometimes work and then not for the same chunk size.

I think maybe there's a relationship to this issue between 3 variables: the size of the corpus, the chunk size and the number of cores in registerdoparallel(). I just don't know what it is.

Can anyone lend a hand or even reproduce with the linked sample corpus? I'm concerned since I can reproduce the error sometimes, other times I cannot. Changing the chunk size gives a kind of ability to see the error with remove numbers, but not always.


Update Today I resumed my session and could not replicate the error. I created a Google Doc and experimented with differing values for corpus size, number of cores and chunk sizes. In each case everything was a success. So, I tried running on full data and everything worked. However, for my sanity I tried running again on full data and it failed. Now, I'm back to where I was yesterday. It appears as though have run the function on a larger dataset has changed something ... I don't know what. Perhaps a session variable of some sort? So, the new information is that this bug only happens after running the function on a very large dataset. Restarting my session did not solve the problem but resuming the sessions after being away for several hours did.


New information. It might be easier to reproduce the issue on a larger corpus since this is what seems to trigger the issue corpus <- do.call(c, replicate(250, corpus, simplify = F)) will create a 500k docs corpus based on the sample I provided. The function may work the first time you call it but for me it seems to fail the second time.

This issue is hard because if I could reproduce the problem I would likely be able to identify and fix it.


New information. Because there's several things happening with this function it was hard to know where to focus debugging efforts. I was looking at both the fact I'm using multiple temp RDS files to save memory and also the fact that I'm doing parallel processing. I wrote two alternative versions of the script, one that still uses the rds files and breaks the corpus up but does not do parallel processing (replaced %dopar% with just %do% and also removed registerDoParallel line) and one that uses parallel processing but does not use RDS temp files to split the small sample corpus up. I was not able to produce the error with the single core version of the script, only with the version that uses %dopar% was I able to recreate the issue (though the issue is intermittent, it does not always fail with dopar). So, this issue only appears when using %dopar%. The fact I'm using temp RDS files does not appear to be part of the problem.