Get the number of lines in a text file using R

Is there a way to get the number of lines in a file without importing it?

So far this is what I am doing

myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
  test[[i]] <- length(myfilesContent[[i]]$V1)
}

but is too time consuming since each file is quite big.

标签： r file text-files

5条回答

再贱就再见

2楼-- · 2019-01-14 23:30

If you are using linux, this might work for you:

# total lines on a file through system call to wc, and filtering with awk
target_file   <- "your_file_name_here"
total_records <- as.integer(system2("wc",
                                    args = c("-l",
                                             target_file,
                                             " | awk '{print $1}'"),
                                    stdout = TRUE))

in your case:

#
lapply(myfiles, function(x){
                         as.integer(system2("wc",
                                            args = c("-l",
                                                     x,
                                                     " | awk '{print $1}'"),
                                            stdout = TRUE))
                      }
                  )

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-01-14 23:32

You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:

There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
The table does not contain a newline character in the data (e.g. within quotes)

I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:

f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
   nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)

Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:

readBin: ca. 2.4 s.
@luis_js's wc-based solution: 0.1 s.
read.delim: 39.6 s.
EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2019-01-14 23:41

I found this easy way using R.utils package

library(R.utils)
sapply(myfiles,countLines)

here is how it works

0人赞添加讨论(0) 举报

我想做一个坏孩纸

5楼-- · 2019-01-14 23:44

If you:

still want to avoid the system call that a system2("wc"… will cause
are on BSD/Linux or OS X (I didn't test the following on Windows)
don't mind a using a full filename path
are comfortable using the inline package

then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

library(inline)

wc.code <- "
uintmax_t linect = 0; 
uintmax_t tlinect = 0;

int fd, len;
u_char *p;

struct statfs fsb;

static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;

PROTECT(f = AS_CHARACTER(f));

if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {

  if (fstatfs(fd, &fsb)) {
    fsb.f_iosize = SMALL_BUF_SIZE;
  }

  if (fsb.f_iosize != buf_size) {
    if (buf != small_buf) {
      free(buf);
    }
    if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
      buf = small_buf;
      buf_size = SMALL_BUF_SIZE;
    } else {
      buf_size = fsb.f_iosize;
    }
  }

  while ((len = read(fd, buf, buf_size))) {

    if (len == -1) {
      (void)close(fd);
      break;
    }

    for (p = buf; len--; ++p)
      if (*p == '\\n')
        ++linect;
  }

  tlinect += linect;

  (void)close(fd);

}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";

setCMethod("wc",
           signature(f="character"), 
           wc.code,
           includes=c("#include <stdlib.h>", 
                      "#include <stdio.h>",
                      "#include <sys/param.h>",
                      "#include <sys/mount.h>",
                      "#include <sys/stat.h>",
                      "#include <ctype.h>",
                      "#include <err.h>",
                      "#include <errno.h>",
                      "#include <fcntl.h>",
                      "#include <locale.h>",
                      "#include <stdint.h>",
                      "#include <string.h>",
                      "#include <unistd.h>",
                      "#include <wchar.h>",
                      "#include <wctype.h>",
                      "#define SMALL_BUF_SIZE (1024 * 8)"),
           language="C",
           convention=".Call")

wc("FULLPATHTOFILE")

It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

   user  system elapsed 
  0.007   0.003   0.010

0人赞添加讨论(0) 举报

Anthone

6楼-- · 2019-01-14 23:45

Maybe I am missing something but usually I do it using length on top of ReadLines:

con <- file("some_file.format") 
length(readLines(con))

This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.

0人赞添加讨论(0) 举报

Get the number of lines in a text file using R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间