How to jump to a particular line in a huge text fi

2019-01-01 02:06发布

Are there any alternatives to the code below:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.

15条回答
人间绝色
2楼-- · 2019-01-01 02:22

What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.

  • Which line do you want?.
  • Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
  • Use seek or whatever to directly jump to get the line from index file.
  • Parse to get byte offset for corresponding line of actual file.
查看更多
零度萤火
3楼-- · 2019-01-01 02:25

I have had the same problem (need to retrieve from huge file specific line).

Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.

I found out next decision: Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

ultimately, aim function:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) – command that execute pruning of file up to line inception. So, if you next commit readline – you obtain your target line.

Using such approach I have saved significant part of time.

查看更多
流年柔荑漫光年
4楼-- · 2019-01-01 02:25

You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file

example:

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

then use f.seek(offsets) to move to the line you need

查看更多
荒废的爱情
5楼-- · 2019-01-01 02:26

You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
查看更多
一个人的天荒地老
6楼-- · 2019-01-01 02:26

Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.

Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().

查看更多
怪性笑人.
7楼-- · 2019-01-01 02:29

linecache:

The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...

查看更多
登录 后发表回答