I've got a file which has a list of names and their position(start - end).
My script iterates over that file and per name it reads another file with info to check if that line is between those positions and then calculates something out of that.
At the moment it reads the whole second file(60MB) line by line checking if it's between the start / end. For every name in the first list(approx 5000). What's the fastest way to collect the data that's between those parameters instead of rereading the whole file 5000 times?
Sample code of the second loop:
for line in file:
if int(line.split()[2]) >= start and int(line.split()[2]) <= end:
Dosomethingwithline():
EDIT: Loading the file in a list above the first loop and iterating over that improved the speed.
with open("filename.txt", 'r') as f:
file2 = f.readlines()
for line in file:
[...]
for line2 in file2:
[...]
Maybe switch your loops around? Make iterating over the file the outer loop, and iterating over the name list the inner loop.
You can use the mmap module to load that file into memory, then iterate.
Example:
It seems to me that your problem is not so much re-reading files, but matching slices of a long list with a short list. As other answers have pointed out, you can use plain lists or memory-mapped files to speed up your program.
If you care to use specific data structures for further speed up, then I would advise you to look into blist, specifically because it has a better performance in slicing lists than the standard Python list: they claim O(log n) instead of O(n).
I have measured a speedup of almost 4x on lists of ~10MB:
As measured by IPython's
%time
command, theblist
takes 12 s where the plainlist
takes 46 s: