(Python) Counting lines in a huge (>10GB) file as

This question already has an answer here:

How to get line count cheaply in Python? 37 answers

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

标签： python enumerate line-count

5条回答

SAY GOODBYE

2楼-- · 2019-02-04 06:42

mmap the file, and count up the newlines.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-02-04 06:42

I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('\n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('\n')
                last_bl=bl

        if not last_bl.endswith('\n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join()

This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.

0人赞添加讨论(0) 举报

手持菜刀，她持情操

4楼-- · 2019-02-04 06:45

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If your on windows Coreutils

0人赞添加讨论(0) 举报

混吃等死

5楼-- · 2019-02-04 06:59

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))

0人赞添加讨论(0) 举报

我命由我不由天

6楼-- · 2019-02-04 06:59

An fast, 1-line solution is:

sum((1 for i in open(file_path, 'rb')))

It should work on files of arbitrary size.

0人赞添加讨论(0) 举报

(Python) Counting lines in a huge (>10GB) file as

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间