Counting rows with fread without reading the whole

2019-02-20 23:51发布

问题:

This question already has an answer here:

  • Is it possible to get the number of rows in a CSV file without opening it? 4 answers

I want to use data.table to process a very big file. It doesn't fit on memory. I've thought on reading the file on chunks using a loop with (increasing properly the skip parameter).

fread("myfile.csv", skip=loopindex, nrows=chunksize) 

processing each of this chunks and appending the resulting output with fwrite.

In order to do it properly I need to know the total number of rows, without reading the whole file.

What's the proper/faster way to do it?

I can ony think in reading only the first column but maybe there is an special command or trick. or maybe there is an automatic way to detect the end of the file.

回答1:

1) count.fields Not sure if count.fields reads the whole file into R at once. Try it to see if it works.

length(count.fields("myfile.csv", sep = ","))

If the file has a header subtract one from the above.

2) sqldf Another possibility is:

library(sqldf)
read.csv.sql("myfile.csv", sep = ",", sql = "select count(*) from file")

You may need other arguments as well depending on header, etc. Note that this does not read the file into R at all -- only into sqlite.

3) wc Use the system command wc which should be available on all platforms that R runs on.

shell("wc -l myfile.csv", intern = TRUE)

or to directly get the number of lines in the file

read.table(pipe("wc -l myfile.csv"))[[1]]

or

read.table(text = shell("wc -l myfile.csv", intern = TRUE))[[1]]

Again, if there is a header subtract one.

If you are on Windows be sure that Rtools is installed and use this:

read.table(pipe("C:\\Rtools\\bin\\wc -l myfile.csv"))[[1]]

Alternately on Windows without Rtools try this:

read.table(pipe('find /v /c "" myfile.csv'))[[3]]

See How to count no of lines in text file and store the value into a variable using batch script?



回答2:

The answer by @G. Grothendieck about using wc -l is a good one, if you can rely on it being present.

You might also want to look into iterating through the file in chunks, e.g. by employing something like this answer that only relies on base R functions.

Since you don't need to read single lines, you can read in a batch from a connection. For instance:

count_lines = function(filepath, batch) {
    con = file(filepath, "r")
    n = 0
    while ( TRUE ) {
        lines = readLines(con, n = batch)
        present = length(lines)
        n = n + present
        if ( present <  batch) {
            break
        }
    }
    close(con)
    return(n)
}

Then you could read the file in, at say 1,000 lines at a time:

count_lines("filename.txt", 1000)