I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before
read.table
:You may also try
scan
, it is faster and gives more control.I compile a solution based on the discussions here.
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.
If it is only read a text file generally
, where the parameter
sep
is for setting the splitting character.About the function
readLine
, official document could be referenced: https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/readLinesIf your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
If it's a binary file
Some discussion is here: Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the
,nrows
argument toread.csv
or any of theread.table
family. If not, you can combine the,nrows
and the,skip
arguments to repeatedly callread.csv
(reading in a new row or group of contiguous rows with each call) and thenrbind
the results together.Before I was able to get an R solution/answer, I've done it in Ruby:
runs fast (as fast as my storage can read the file).