How to find the byte position of specific line in

What's the fastest way to find the byte position of a specific line in a file, from the command line?

e.g.

$ linepos myfile.txt 13
5283

I'm writing a parser for a CSV that's several GB in size, and in the event the parser is halted, I'd like to be able to resume from the last position. The parser is in Python, but even iterating over file.readlines() takes a long time, since there are millions of rows in the file. I'd like to simply do file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow))), but I can't find a shell command to efficiently do this.

Edit: Sorry for the confusion, but I'm looking for a non-Python solution. I already know how to do this from Python.

标签： linux bash command-line

3条回答

The star\"

2楼-- · 2019-05-02 01:23

From @chepner's comment on my other answer:

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:
        file.seek(position)  # zero in base case
        for line in file:
            position = file.tell() # current seek position in file
            # process the line
except:
    print 'exception occurred at position {}'.format(position)
    raise

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-05-02 01:23

Well, if your pattern is simple, this would be simple

$ echo -e '#!/bin/bash\necho abracadabra' >/tmp/script
$ pattern=bash
$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
    | wc -c 
8

As you can see, this will output position of the first character in your pattern, assuming first character in the file has number 1.

NB 1: sed has a habit to add a trailing newline to the last string it parses, thus, when we take a part of the line preceding the pattern, the number of bytes in output should be 7 (count them → #!/bin/), but what wc -c actually counts looks like

$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
   | hexdump -C
00000000  23 21 2f 62 69 6e 2f 0a                           |#!/bin/.|
00000008

This could be potential source of troubles, if you were looking for EOF, for example. I can’t think of a more appropriate case, I just want to point that out.

NB 2: If pattern will contain special characters, sed will fail. If you could provide an example of what you are looking for, I could escape it.

NB 3: This assumes the pattern is unique. If you will stop reading the file on a second or third instance of pattern, this will not work.

Update. I’ve found a more simple way.

$ grep -bo bash <<< '#!/bin/bash'
7:bash

For GNU grep there are two options:

-b, --byte-offset
    Print the 0-based byte offset within the input file before  each  line  of
    output. If -o (--only-matching)  is specified, print the offset of the
    matching part itself.

I’d suggest to use grep, because if you specify -F key, it will treat pattern as a simple string.

$ grep -F '!@##$@#%%^%&*%^&*(^)((**%%^@#' <<<'!@##$@#%%^%&*%^&*(^)((**%%^@#' 
!@##$@#%%^%&*%^&*(^)((**%%^@#

0人赞添加讨论(0) 举报

我命由我不由天

4楼-- · 2019-05-02 01:26

Iterating over a file object yields lines with full line endings intact. You should be able to just add the lens to a counter object to get the position. You'll need to multiply based on character encoding (character byte size)

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:  # don't you go correcting me on naming it file. we don't call file directly anyway!
        file.seek(position)  # zero in base case
        for line in file:
            position += len(line)
            # process the line
except:
    # yes, a naked exception. TWO faux pas in one answer?!?
    print 'exception occurred at position {}'.format(position)
    raise # re-raise to see traceback or what have you

0人赞添加讨论(0) 举报

How to find the byte position of specific line in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间