How to find the byte position of specific line in

2019-05-02 01:05发布

What's the fastest way to find the byte position of a specific line in a file, from the command line?

e.g.

$ linepos myfile.txt 13
5283

I'm writing a parser for a CSV that's several GB in size, and in the event the parser is halted, I'd like to be able to resume from the last position. The parser is in Python, but even iterating over file.readlines() takes a long time, since there are millions of rows in the file. I'd like to simply do file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow))), but I can't find a shell command to efficiently do this.

Edit: Sorry for the confusion, but I'm looking for a non-Python solution. I already know how to do this from Python.

3条回答
The star\"
2楼-- · 2019-05-02 01:23

From @chepner's comment on my other answer:

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:
        file.seek(position)  # zero in base case
        for line in file:
            position = file.tell() # current seek position in file
            # process the line
except:
    print 'exception occurred at position {}'.format(position)
    raise
查看更多
我只想做你的唯一
3楼-- · 2019-05-02 01:23

Well, if your pattern is simple, this would be simple

$ echo -e '#!/bin/bash\necho abracadabra' >/tmp/script
$ pattern=bash
$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
    | wc -c 
8

As you can see, this will output position of the first character in your pattern, assuming first character in the file has number 1.

NB 1: sed has a habit to add a trailing newline to the last string it parses, thus, when we take a part of the line preceding the pattern, the number of bytes in output should be 7 (count them → #!/bin/), but what wc -c actually counts looks like

$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
   | hexdump -C
00000000  23 21 2f 62 69 6e 2f 0a                           |#!/bin/.|
00000008

This could be potential source of troubles, if you were looking for EOF, for example. I can’t think of a more appropriate case, I just want to point that out.

NB 2: If pattern will contain special characters, sed will fail. If you could provide an example of what you are looking for, I could escape it.

NB 3: This assumes the pattern is unique. If you will stop reading the file on a second or third instance of pattern, this will not work.


Update. I’ve found a more simple way.

$ grep -bo bash <<< '#!/bin/bash'
7:bash

For GNU grep there are two options:

-b, --byte-offset
    Print the 0-based byte offset within the input file before  each  line  of
    output. If -o (--only-matching)  is specified, print the offset of the
    matching part itself.

I’d suggest to use grep, because if you specify -F key, it will treat pattern as a simple string.

$ grep -F '!@##$@#%%^%&*%^&*(^)((**%%^@#' <<<'!@##$@#%%^%&*%^&*(^)((**%%^@#' 
!@##$@#%%^%&*%^&*(^)((**%%^@#
查看更多
我命由我不由天
4楼-- · 2019-05-02 01:26

Iterating over a file object yields lines with full line endings intact. You should be able to just add the lens to a counter object to get the position. You'll need to multiply based on character encoding (character byte size)

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:  # don't you go correcting me on naming it file. we don't call file directly anyway!
        file.seek(position)  # zero in base case
        for line in file:
            position += len(line)
            # process the line
except:
    # yes, a naked exception. TWO faux pas in one answer?!?
    print 'exception occurred at position {}'.format(position)
    raise # re-raise to see traceback or what have you
查看更多
登录 后发表回答