Fastest way to re-read a file in Python?

2019-08-11 15:59发布

I've got a file which has a list of names and their position(start - end).

My script iterates over that file and per name it reads another file with info to check if that line is between those positions and then calculates something out of that.

At the moment it reads the whole second file(60MB) line by line checking if it's between the start / end. For every name in the first list(approx 5000). What's the fastest way to collect the data that's between those parameters instead of rereading the whole file 5000 times?

Sample code of the second loop:

for line in file:
    if int(line.split()[2]) >= start and int(line.split()[2]) <= end:
        Dosomethingwithline():

EDIT: Loading the file in a list above the first loop and iterating over that improved the speed.

with open("filename.txt", 'r') as f:
    file2 = f.readlines()
for line in file:
    [...]
    for line2 in file2:
       [...]

3条回答
Rolldiameter
2楼-- · 2019-08-11 16:53

Maybe switch your loops around? Make iterating over the file the outer loop, and iterating over the name list the inner loop.

name_and_positions = [
    ("name_a", 10, 45),
    ("name_b", 2, 500),
    ("name_c", 96, 243),
]

with open("somefile.txt") as f:
    for line in f:
        value = int(line.split()[2])
        for name, start, end in name_and_positions:
            if start <= value <= end:
                print("matched {} with {}".format(name, value))
查看更多
forever°为你锁心
3楼-- · 2019-08-11 16:54

You can use the mmap module to load that file into memory, then iterate.

Example:

import mmap

# write a simple example file
with open("hello.txt", "wb") as f:
    f.write(b"Hello Python!\n")

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()
查看更多
Emotional °昔
4楼-- · 2019-08-11 16:56

It seems to me that your problem is not so much re-reading files, but matching slices of a long list with a short list. As other answers have pointed out, you can use plain lists or memory-mapped files to speed up your program.

If you care to use specific data structures for further speed up, then I would advise you to look into blist, specifically because it has a better performance in slicing lists than the standard Python list: they claim O(log n) instead of O(n).

I have measured a speedup of almost 4x on lists of ~10MB:

import random

from blist import blist

LINE_NUMBER = 1000000


def write_files(line_length=LINE_NUMBER):
    with open('haystack.txt', 'w') as infile:
        for _ in range(line_length):
            infile.write('an example\n')

    with open('needles.txt', 'w') as infile:
        for _ in range(line_length / 100):
            first_rand = random.randint(0, line_length)
            second_rand = random.randint(first_rand, line_length)
            needle = random.choice(['an example', 'a sample'])
            infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))


def read_files():
    with open('haystack.txt', 'r') as infile:
        normal_list = []
        for line in infile:
            normal_list.append(line.strip())

    enhanced_list = blist(normal_list)
    return normal_list, enhanced_list


def match_over(list_structure):
    matches = 0
    total = len(list_structure)
    with open('needles.txt', 'r') as infile:
        for line in infile:
            needle, start, end = line.split('\t')
            start, end = int(start), int(end)
            if needle in list_structure[start:end]:
                matches += 1
    return float(matches) / float(total)

As measured by IPython's %time command, the blist takes 12 s where the plain list takes 46 s:

In [1]: import main

In [3]: main.write_files()

In [4]: !ls -lh *.txt
10M haystack.txt
233K needles.txt

In [5]: normal_list, enhanced_list = main.read_files()

In [8]: %time main.match_over(normal_list)
CPU times: user 44.9 s, sys: 1.47 s, total: 46.4 s
Wall time: 46.4 s
Out[8]: 0.005032

In [9]: %time main.match_over(enhanced_list)
CPU times: user 12.6 s, sys: 33.7 ms, total: 12.6 s
Wall time: 12.6 s
Out[9]: 0.005032
查看更多
登录 后发表回答