This question already has an answer here:
-
How to get line count cheaply in Python?
37 answers
I have a really simple script right now that counts lines in a text file using enumerate()
:
i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
pass
print i + 1
f.close()
This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.
I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...
Thank you!
Ignacio's answer is correct, but might fail if you have a 32 bit process.
But maybe it could be useful to read the file block-wise and then count the \n
characters in each block.
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r") as f:
print sum(bl.count("\n") for bl in blocks(f))
will do your job.
Note that I don't open the file as binary, so the \r\n
will be converted to \n
, making the counting more reliable.
For Python 3, and to make it more robust, for reading files with all kinds of characters:
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r",encoding="utf-8",errors='ignore') as f:
print (sum(bl.count("\n") for bl in blocks(f)))
I know its a bit unfair but you could do this
int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])
If your on windows Coreutils
mmap the file, and count up the newlines.
An fast, 1-line solution is:
sum((1 for i in open(file_path, 'rb')))
It should work on files of arbitrary size.
I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:
def blocks(f, cut, size=64*1024): # 65536
start, chunk =cut
iter=0
read_size=int(size)
_break =False
while not _break:
if _break: break
if f.tell()+size>start+chunk:
read_size=int(start+chunk- f.tell() )
_break=True
b = f.read(read_size)
iter +=1
if not b: break
yield b
def get_chunk_line_count(data):
fn, chunk_id, cut = data
start, chunk =cut
cnt =0
last_bl=None
with open(fn, "r") as f:
if 0:
f.seek(start)
bl = f.read(chunk)
cnt= bl.count('\n')
else:
f.seek(start)
for i, bl in enumerate(blocks(f,cut)):
cnt += bl.count('\n')
last_bl=bl
if not last_bl.endswith('\n'):
cnt -=1
return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
)
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join()
This will improve counting performance 20 folds.
I wrapped it to a script and put it to Github.