This question already has an answer here:
- How to get line count cheaply in Python? 37 answers
I have a really simple script right now that counts lines in a text file using enumerate()
:
i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
pass
print i + 1
f.close()
This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.
I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...
Thank you!
mmap the file, and count up the newlines.
I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:
This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.
I know its a bit unfair but you could do this
If your on windows Coreutils
Ignacio's answer is correct, but might fail if you have a 32 bit process.
But maybe it could be useful to read the file block-wise and then count the
\n
characters in each block.will do your job.
Note that I don't open the file as binary, so the
\r\n
will be converted to\n
, making the counting more reliable.For Python 3, and to make it more robust, for reading files with all kinds of characters:
An fast, 1-line solution is:
It should work on files of arbitrary size.