I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?
At the moment I do:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
is it possible to do any better?
I would use Python's file object method
readlines
, as follows:This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:
You can play around with the buffer size and maybe see a little improvement.
Another possibility:
Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.
one line solution
my snippet