Fastest way to skip lines while parsing files in R

2019-03-26 06:46发布

问题:

I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case.

I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information?

Currently I'm using something along the lines of:

while  buffer = file_in.gets and file_in.lineno <600
  next unless file_in.lineno > 500
  if buffer.chomp!.include? some_string
    do_func_whatever
  end
end

It works, but I just can't help but think it could work better.

I'm very new to Ruby and am interested in learning new ways of doing things in it.

回答1:

file.lines.drop(500).take(100) # will get you lines 501-600

Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish.

PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided:

file.lines.select.with_index{|l,i| (501..600) === i}

PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version:

enum = file.lines
500.times{enum.next} # skip 500
enum.take(100) # take the next 100

or, if you prefer FP:

file.lines.tap{|enum| 500.times{enum.next}}.take(100)

Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)



回答2:

I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes.

See IO#seek, or see IO#open for information on the offset argument.



回答3:

Sounds like rio might be of help here. It provides you with a lines() method.



回答4:

You can use IO#readlines, that returns an array with all the lines

IO.readlines(file_in)[500..600].each do |line| 
  #line is each line in the file (including the last \n)
  #stuff
end

or

f = File.new(file_in)
f.readlines[500..600].each do |line| 
  #line is each line in the file (including the last \n)
  #stuff
end