How to get line count cheaply in Python?

2018-12-31 03:20发布

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

is it possible to do any better?

30条回答
弹指情弦暗扣
2楼-- · 2018-12-31 03:56

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

查看更多
深知你不懂我心
3楼-- · 2018-12-31 03:59

Why not read the first 100 and the last 100 lines and estimate the average line length, then divide the total file size through that numbers? If you don't need a exact value this could work.

查看更多
刘海飞了
4楼-- · 2018-12-31 04:00

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

查看更多
一个人的天荒地老
5楼-- · 2018-12-31 04:00
def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count
查看更多
明月照影归
6楼-- · 2018-12-31 04:02

Kyle's answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
查看更多
浮光初槿花落
7楼-- · 2018-12-31 04:02

Here is what I use, seems pretty clean:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

查看更多
登录 后发表回答