R - Big Data - vector exceeds vector length limit

2019-07-04 02:51发布

问题:

I have the following R code:

data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)

Some more statistics of my testfile.data:

> ncol(data)
[1] 75713
> nrow(data)
[1] 44771

Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works). But as.matrix transformation (2nd line errors) throws the following error:

resulting vector exceeds vector length limit in 'AnswerType'

Any clue what might be the issue?

回答1:

As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance

> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555

This is with

> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"

Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported

> sum(m)
Error: long vectors not supported yet:
    /home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100

So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.



回答2:

Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.

R> 75713*44771 > 2^31-1
[1] TRUE

See ?"Memory-limits" for more details.



标签: r bigdata