I have a 5GB
csv with 2 million rows. The header are comma separated strings
and each row are comma separated doubles
with no missing or corrupted data. It is rectangular.
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv
) is to read in the whole matrix with fread
and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
The solution deserving of a bounty will give system.time()
estimates of different alternatives.
Other:
- I am using Linux
- I don't need exactly 10% of the rows. Just approximately 10%.
Here's a file with 100000 lines in it like this:
So that's 100000 lines plus a header. We want to keep the header and sample each line if a random number from 0 to 1 is greater than 0.9.
check:
and it has 10027 lines:
This took 0.033s of real time on my 4-yo box, probably the HD speed is the limiting factor here. It should scale linearly since the file is being dealt with strictly line-by-line.
You then read in
sample.csv
usingread.csv
orfread
as desired:I think this should work pretty quickly, but let me know since I have not tried with big data yet.
This takes a random sample of N=5 for the
iris
dataset.To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
You could use
sqldf::read.csv.sql
and an SQL command to pull the data in:It doesn't calculate the 10% for you, but you can choose the absolute limit of rows to return.