Given the input:
x = ['foo bar', 'bar blah', 'black sheep']
I could do this to get the count of each word in the list of space delimited string:
from itertools import chain
from collections import Counter
c = Counter(chain(*map(str.split, x)))
Or I could simple iterate through and get:
c = Counter()
for sent in x:
for word in sent.split():
c[word]+=1
[out]:
Counter({'bar': 2, 'sheep': 1, 'blah': 1, 'foo': 1, 'black': 1})
The question is which is more efficient if the input list of string is extremely huge? Are there other ways to achieve the same counter object?
Imagine it's a text file object that has billions of lines with 10-20 words each.
Assuming you are in Python 3x, both
chain(*map(str.split, x))
and simple iteration will create intermediate lists sequentially from each line; this will not take up much memory in either case. Performance should be very close and may be implementation-dependent.However, it is most efficient memory-wise to create a generator function to feed Counter(). Either way you use string.split(), it creates intermediate lists which are not necessary. This could cause slowdown if you have a particularly long line, but to be honest it's unlikely.
Such a generator function is described below. Note that I am using optional typing for clarity.
The answer to your question is profiling.
Following are some profiling tools: