What's the fastest way to find the byte position of a specific line in a file, from the command line?
e.g.
$ linepos myfile.txt 13
5283
I'm writing a parser for a CSV that's several GB in size, and in the event the parser is halted, I'd like to be able to resume from the last position. The parser is in Python, but even iterating over file.readlines()
takes a long time, since there are millions of rows in the file. I'd like to simply do file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow)))
, but I can't find a shell command to efficiently do this.
Edit: Sorry for the confusion, but I'm looking for a non-Python solution. I already know how to do this from Python.
From @chepner's comment on my other answer:
Well, if your pattern is simple, this would be simple
As you can see, this will output position of the first character in your pattern, assuming first character in the file has number 1.
NB 1:
sed
has a habit to add a trailing newline to the last string it parses, thus, when we take a part of the line preceding thepattern
, the number of bytes in output should be 7 (count them →#!/bin/
), but whatwc -c
actually counts looks likeThis could be potential source of troubles, if you were looking for EOF, for example. I can’t think of a more appropriate case, I just want to point that out.
NB 2: If pattern will contain special characters, sed will fail. If you could provide an example of what you are looking for, I could escape it.
NB 3: This assumes the
pattern
is unique. If you will stop reading the file on a second or third instance ofpattern
, this will not work.Update. I’ve found a more simple way.
For GNU grep there are two options:
I’d suggest to use grep, because if you specify
-F
key, it will treat pattern as a simple string.Iterating over a file object yields lines with full line endings intact. You should be able to just add the
len
s to a counter object to get the position. You'll need to multiply based on character encoding (character byte size)