I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?
At the moment I do:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
is it possible to do any better?
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many
\n
you have, and return that result.Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
the result of opening a file is an iterator, which can be converted to a sequence, which has a length:
this is more concise than your explicit loop, and avoids the
enumerate
.I have modified the buffer case like this:
Now also empty files and the last line (without \n) are counted.
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
Using a separate generator function, this runs a smidge faster:
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
Here are my timings:
You could execute a subprocess and run
wc -l filename