R reading a huge csv

2020-02-17 09:43发布

I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.

If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field: 
--max-vsize=500M 

Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.

> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update1

My 1st try based upon suggested answer

> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update2

my 2nd try based upon suggested answer is as below

thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE, 
+                    first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)

How could I read this file into a single object so that I can analyze the entire data in one go

------------------update 3

We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM

标签: r windows csv ram
5条回答
手持菜刀,她持情操
2楼-- · 2020-02-17 10:04

This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling memory.limit(new) where new an integer with R's new memory.limit in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).

If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.

Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2)) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.

查看更多
家丑人穷心不美
3楼-- · 2020-02-17 10:09

You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.

Install SQLite

https://www.sqlite.org/download.html

Enter the data into a new SQLite database

  • Save the following in a new file named import.sql

CREATE TABLE tableName (COL1, COL2, COL3, COL4); .separator , .import YOURDATA.csv tableName

Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.

  • Import the data into the SQLite database via the command line

sqlite3.exe BIGDATA.sqlite3 < import.sql

Point dplyr to the SQLite database

As we're using SQLite, all of the dependencies are handled by dplyr already.

library(dplyr) my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE) my_tbl <- tbl(my_db, "tableName")

Do your exploratory analysis

dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]
#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>  
#>         COL2    meanVal
#>        <chr>      <dbl>
#>  1      1979   15.26476
#>  2      1980   16.09677
#>  3      1981   15.83936
#>  4      1982   14.47380
#>  5      1983   15.36479
查看更多
家丑人穷心不美
4楼-- · 2020-02-17 10:13

This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.

DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 40001072 bytes #40 MB

Two OOM larger:

DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 4000001072 bytes #4.00 GB

There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using

  • x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).

  • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).

Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.

查看更多
Bombasti
5楼-- · 2020-02-17 10:17

You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a for loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.

x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
    read.csv(con, nrows=CHUNK_SIZE,...)
}

Hope that helps.

查看更多
太酷不给撩
6楼-- · 2020-02-17 10:20

Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.

In addition, you can read in the file in chunks:

file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)

You can use data.table to handle reading and manipulating large files more efficiently:

require(data.table)
fread("in.csv", header = T)

If needed, you can leverage storage memory with ff:

library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, 
                   first.rows=10000, next.rows=50000, colClasses=NA)
查看更多
登录 后发表回答