Is there a way to get the number of lines in a file without importing it?
So far this is what I am doing
myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
test[[i]] <- length(myfilesContent[[i]]$V1)
}
but is too time consuming since each file is quite big.
If you are using linux, this might work for you:
in your case:
You can count the number of newline characters (
\n
, will also work for\r\n
on Windows) in a file. This will give you a correct answer iff:read.csv
gives a warning if this doesn't hold)I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:
Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:
readBin
: ca. 2.4 s.@luis_js's
wc
-based solution: 0.1 s.read.delim
: 39.6 s.EDIT: reading a file line by line with
readLines
(f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)
): 32.0 s.I found this easy way using R.utils package
here is how it works
If you:
system2("wc"…
will causeinline
packagethen the following should be about as fast as you can get (it's pretty much the 'line count' portion of
wc
in an inline R C function):It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a
189,955
line file I had lying around, I get (mean values from a bunch of runs):Maybe I am missing something but usually I do it using length on top of ReadLines:
This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.