How to get line count cheaply in Python?-第4页回答

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

is it possible to do any better?

标签： python text-files line-count

30条回答

千与千寻千般痛.

2楼-- · 2018-12-31 03:54

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

0人赞添加讨论(0) 举报

旧时光的记忆

3楼-- · 2018-12-31 03:54

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

0人赞添加讨论(0) 举报

只若初见

4楼-- · 2018-12-31 03:54

I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

Now also empty files and the last line (without \n) are counted.

0人赞添加讨论(0) 举报

骚的不知所云

5楼-- · 2018-12-31 03:55

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

0人赞添加讨论(0) 举报

刘海飞了

6楼-- · 2018-12-31 03:55

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

0人赞添加讨论(0) 举报

墨雨无痕

7楼-- · 2018-12-31 03:56

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

0人赞添加讨论(0) 举报

How to get line count cheaply in Python?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间