Extremely slow R code and hanging

2019-09-08 16:40发布

问题:

Calling read.table() function (on a CSV file), as follows:

  download.file(url, destfile = file, mode = "w")
  conn <- gzcon(bzfile(file, open = "r"))
  try(fileData <- read.table(conn, sep = ",", row.names = NULL), silent = FALSE)

produces the following error:

Error in pushBack(c(lines, lines), file) : 
  can only push back on text-mode connections

I tried to “wrap” the connection explicitly by tConn <- textConnection(readLines(conn)) [and then, certainly, passing tConn instead of conn to read.table()], but it triggered extreme slowness in code execution and eventual hanging or R processes (had to restart R).

UPDATE (That shows again how useful is to try to explain your problems to other people!):

As I was writing this, I decided to go back to documentation and read again on gzcon(), which I thought not only decompresses bzip2 file, but “labels” it as text. But then I realized that it’s a ridiculous assumption, as I know that it’s a text (CSV) file inside the bzip2 archive, but R doesn’t. Therefore, my initial attempt to use textConnection() was the right approach, but something creates a problem. If - and it’s a big IF - my logic is correct until this, the next question is whether the problem is due to textConnection() or readLines().

Please advise. Thank you!

P.S. The CSV files that I'm trying to read are in an "almost" CSV format, so I can't use standard R functions for CSV processing.

===

UPDATE 1 (Program Output):

===

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 514960 bytes (502 Kb)
opened URL
==================================================
downloaded 502 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDependencies2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 133295 bytes (130 Kb)
opened URL
==================================================
downloaded 130 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 5404286 bytes (5.2 Mb)
opened URL
==================================================
downloaded 5.2 Mb

===

UPDATE 2 (Program output):

===

After very long time, I'm getting the following message, then the program continues processing the rest of the files:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 8 elements

Then the situation repeats: after processing several smaller (less than 1MB) files, the program "freezes" on processing a larger (> 1MB) file:

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectTags2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 1226391 bytes (1.2 Mb)
opened URL
==================================================
downloaded 1.2 Mb

===

UPDATE 3 (Program output):

===

After giving the program more time to run, I discovered the following:

*) My assumption that file size ~1MB plays role in weird behavior was wrong. This is based on the fact that the program successfully processed files with size > 1MB and could not process files with size < 1MB. This is an example output with errors:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 826288 bytes (806 Kb)
opened URL
==================================================
downloaded 806 Kb

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

Example with errors processing very small file:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 3092 bytes
opened URL
==================================================
downloaded 3092 bytes

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 2 elements

From the above examples, it is clear that size is not the factor, but file structure might be.

*) I wrongfully reported the maximum file size, it's 54.2MB compressed. This is the file, which processing not only generates error messages and continues, but it actually triggers an unrecoverable error and stops (exits):

trying URL 'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 56793796 bytes (54.2 Mb)
opened URL
=================================================
downloaded 54.2 Mb

Error in textConnection(readLines(conn)) : 
  cannot allocate memory for text connection

*) After emergency exit, five R processes use 51% of memory each, while after manual R restart, this number remains 7% (data per htop report).

Even considering the possibility of "very bad" text/CSV format (suggested by "Error in scan() messages"), the behavior of standard R functions textConnection() and/or readLines() look to me very strange, even "suspicious". My understanding is that good function should process erroneous input data gracefully, allowing very limited time/retries and then continuing processing, if possible, or exiting when further processing is impossible. In this case we see (via the defect ticket screenshot) that R process is taxing both memory and processor of the virtual machine.

回答1:

When this has happened to me in the past, I get better performance by not using "textConnection". Instead, if I have to do some preprocessing by using 'readLines', I will them write the data to a temporary file and then use that file as input to 'read.table'.



回答2:

You don't have CSV files. I only looked (yes, actually had a look in a text editor) at one of them but they seem to be tab delimited.

url <- 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file <- "temp.txt.bz2"
download.file(url, destfile = file, mode = "w")
dat <- bzfile(file, open = "r")
DF <- read.table(dat, header=TRUE, sep="\t")
close(dat)

head(DF)
#   proj_num proj_unixname               requirement       requirement_type      date_collected datasource_id
# 1       14          A2ps                    E-mail           Help,Support 2012-11-02 10:57:40           346
# 2       99          Acct                    E-mail           Bug Tracking 2012-11-02 10:57:40           346
# 3      128          Adns    VCS Repository Webview              Developer 2012-11-02 10:57:40           346
# 4      128          Adns                    E-mail                   Help 2012-11-02 10:57:40           346
# 5      196        AmaroK    VCS Repository Webview           Bug Tracking 2012-11-02 10:57:40           346
# 6      196        AmaroK Mailing List Info/Archive Bug Tracking,Developer 2012-11-02 10:57:40           346