Read lines by number from a large file

2019-01-21 16:55发布

I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.

How can I read-out the lines in one pass?

I was hoping for a C function that does it on one pass.

6条回答
三岁会撩人
2楼-- · 2019-01-21 17:31

The trick is to use connection AND open it before read.table:

con<-file('filename')
open(con)

read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)

You may also try scan, it is faster and gives more control.

查看更多
劳资没心,怎么记你
3楼-- · 2019-01-21 17:38

I compile a solution based on the discussions here.

scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)

This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.

查看更多
来,给爷笑一个
4楼-- · 2019-01-21 17:40

If it is only read a text file generally

cat(readLines("filename.txt", n=10), sep="\n")

, where the parameter sep is for setting the splitting character.

About the function readLine, official document could be referenced: https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/readLines

查看更多
孤傲高冷的网名
5楼-- · 2019-01-21 17:41

If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.

However, from the R docs:

 Use of seek on Windows is discouraged.  We have found so many
 errors in the Windows implementation of file positioning that
 users are advised to use it only at their own risk, and asked not
 to waste the R developers' time with bug reports on Windows'
 deficiencies.

You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!

查看更多
我欲成王,谁敢阻挡
6楼-- · 2019-01-21 17:48

If it's a binary file

Some discussion is here: Reading in only part of a Stata .DTA file in R

If it's a CSV or other text file

If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.

查看更多
Evening l夕情丶
7楼-- · 2019-01-21 17:55

Before I was able to get an R solution/answer, I've done it in Ruby:

#!/usr/bin/env ruby

NUM_SEQS = 14024829

linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}

File.open("./data/uniprot_2011_02.tab") do |f|
  while line = f.gets
    print line if linenumbers.include? f.lineno 
  end
end

runs fast (as fast as my storage can read the file).

查看更多
登录 后发表回答