Fast/Efficient counting of list of space delimited

Given the input:

x = ['foo bar', 'bar blah', 'black sheep']

I could do this to get the count of each word in the list of space delimited string:

from itertools import chain
from collections import Counter
c = Counter(chain(*map(str.split, x)))

Or I could simple iterate through and get:

c = Counter()
for sent in x:
    for word in sent.split():
        c[word]+=1

[out]:

Counter({'bar': 2, 'sheep': 1, 'blah': 1, 'foo': 1, 'black': 1})

The question is which is more efficient if the input list of string is extremely huge? Are there other ways to achieve the same counter object?

Imagine it's a text file object that has billions of lines with 10-20 words each.

标签： python dictionary counter itertools chain

2条回答

干净又极端

2楼-- · 2019-07-25 11:55

Assuming you are in Python 3x, both chain(*map(str.split, x)) and simple iteration will create intermediate lists sequentially from each line; this will not take up much memory in either case. Performance should be very close and may be implementation-dependent.

However, it is most efficient memory-wise to create a generator function to feed Counter(). Either way you use string.split(), it creates intermediate lists which are not necessary. This could cause slowdown if you have a particularly long line, but to be honest it's unlikely.

Such a generator function is described below. Note that I am using optional typing for clarity.

from typing import Iterable, Generator
def gen_words(strings: Iterable[str]) -> Generator[str]:
    for string in strings:
        start = 0
        for i, char in enumerate(string):
            if char == ' ':
                if start != i:
                    yield string[start:i]
                start = i
        if start != i:
            yield string[start:i]
c = counter(gen_words(strings))

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-07-25 11:55

The answer to your question is profiling.

Following are some profiling tools:

print time.time() in strategic places. (or use Unix time)
cProfile
line_profiler
heapy tracks all objects inside Python’s memory (good for memory leaks)
For long-running systems, use dowser: allows live objects introspection (web browser interface)
memory_profiler for RAM usage
examine Python bytecode with dis

0人赞添加讨论(0) 举报

Fast/Efficient counting of list of space delimited

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间