R could not allocate memory on ff procedure. How c

2019-02-20 00:23发布

问题:

I'm working on a 64-bit Windows Server 2008 machine with Intel Xeon processor and 24 GB of RAM. I'm having trouble trying to read a particular TSV (tab-delimited) file of 11 GB (>24 million rows, 20 columns). My usual companion, read.table, has failed me. I'm currently trying the package ff, through this procedure:

> df <- read.delim.ffdf(file       = "data.tsv",
+                       header     = TRUE,
+                       VERBOSE    = TRUE,
+                       first.rows = 1e3,
+                       next.rows  = 1e6,
+                       na.strings = c("", NA),
+                       colClasses = c("NUMERO_PROCESSO" = "factor"))

Which works fine for about 6 million records, but then I get an error, as you can see:

read.table.ffdf 1..1000 (1000) csv-read=0.14sec ffdf-write=0.2sec
read.table.ffdf 1001..1001000 (1000000) csv-read=240.92sec ffdf-write=67.32sec
read.table.ffdf 1001001..2001000 (1000000) csv-read=179.15sec ffdf-write=94.13sec
read.table.ffdf 2001001..3001000 (1000000) csv-read=792.36sec ffdf-write=68.89sec
read.table.ffdf 3001001..4001000 (1000000) csv-read=192.57sec ffdf-write=83.26sec
read.table.ffdf 4001001..5001000 (1000000) csv-read=187.23sec ffdf-write=78.45sec
read.table.ffdf 5001001..6001000 (1000000) csv-read=193.91sec ffdf-write=94.01sec
read.table.ffdf 6001001..
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

If I'm not mistaken, R is complaining of lack of memory to read the data, but wasn't the read...ffdf procedure supposed to circumvent heavy memory usage when reading data? What could I be doing wrong here?

回答1:

(I realize this is an old question, but I had the same problem and spent two days looking for the solution. This seems as good a place as any to document what I eventually figured out for posterity.)

The problem isn't that you are running out of available memory. The problem is that you've hit the memory limit for a single string. From help('Memory-limits'):

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

In my case (and it appears yours as well) I didn't bother to set the quote character since I was dealing with tab separated data and I assumed it didn't matter. However, somewhere in the middle of the data set, I had a string with an unmatched quote, and then read.table happily ran right past the end of line and on to the next, and the next, and the next... until it hit the limit for the size of a string and blew up.

The solution was to explicitly set quote = "" in the argument list.