I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.
If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field:
--max-vsize=500M
Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.
> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update1
My 1st try based upon suggested answer
> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update2
my 2nd try based upon suggested answer is as below
thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE,
+ first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)
How could I read this file into a single object so that I can analyze the entire data in one go
------------------update 3
We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM
This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling
memory.limit(new)
wherenew
an integer with R's newmemory.limit
in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.
Try starting of with something along the lines of 32 GB (or
memory.limit(memory.limit()*2)
) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that.
dplyr
is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.Install SQLite
https://www.sqlite.org/download.html
Enter the data into a new SQLite database
import.sql
CREATE TABLE tableName (COL1, COL2, COL3, COL4); .separator , .import YOURDATA.csv tableName
Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.
sqlite3.exe BIGDATA.sqlite3 < import.sql
Point
dplyr
to the SQLite databaseAs we're using SQLite, all of the dependencies are handled by
dplyr
already.library(dplyr) my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE) my_tbl <- tbl(my_db, "tableName")
Do your exploratory analysis
dplyr
will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))
This may not be possible on your computer. In certain cases,
data.table
takes up more space than its.csv
counterpart.Two OOM larger:
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T)
gives a factor roughly 2x (R:csv).x = sample(c("foofoofoo","barbarbar"),10000000,replace = T)
gives a factor of 0.5x (R:csv).Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.
You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a
for
loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.Hope that helps.
Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.
In addition, you can read in the file in chunks:
You can use
data.table
to handle reading and manipulating large files more efficiently:If needed, you can leverage storage memory with
ff
: