I'd like to count frequencies of all words in a text file.
>>> countInFile('test.txt')
should return {'aaa':1, 'bbb': 2, 'ccc':1}
if the target text file is like:
# test.txt
aaa bbb ccc
bbb
I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).
I think borrowing sklearn's power is a candidate.
If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.
What is the most efficient and straightforward way to count words in a file with python?
Update
My (very slow) code is here:
from collections import Counter
def get_term_frequency_in_file(source_file_path):
wordcount = {}
with open(source_file_path) as f:
for line in f:
line = line.lower().translate(None, string.punctuation)
this_wordcount = Counter(line.split())
wordcount = add_merge_two_dict(wordcount, this_wordcount)
return wordcount
def add_merge_two_dict(x, y):
return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
The most succinct approach is to use the tools Python gives you.
That's it.
map(str.split, f)
is making a generator that returnslist
s of words from each line. Wrapping inchain.from_iterable
converts that to a single generator that produces a word at a time.Counter
takes an input iterable and counts all unique values in it. At the end, youreturn
adict
-like object (aCounter
) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a
dict
orcollections.defaultdict(int)
to count (becauseCounter
is implemented in Python, which can make it slower in some cases), but lettingCounter
do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use aCounter
). Beyond that, on CPython (the reference interpreter) 3.2 and higherCounter
has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:
Your code runs much more slowly because it's creating and destroying many small
Counter
andset
objects, rather than.update
-ing a singleCounter
once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).Skip CountVectorizer and scikit-learn.
The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into 10-20 smaller files and extend your code to loop over the smaller files.
A memory efficient and accurate way is to make use of
scikit
(for ngram extraction)word_tokenize
numpy
matrix sum to collect the countscollections.Counter
for collecting the counts and vocabularyAn example:
[out]:
Essentially, you can also do this:
Let's
timeit
:[out]:
Note that
CountVectorizer
can also take a file instead of a string and there's no need to read the whole file into memory. In code:Instead of decoding the whole bytes read from the url, I process the binary data. Because
bytes.translate
expects its second argument to be a byte string, I utf-8 encodepunctuation
. After removing punctuations, I utf-8 decode the byte string.The function
freq_dist
expects an iterable. That's why I've passeddata.splitlines()
.Output;
It seems
dict
is more efficient thanCounter
object.Output;
To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.
This should suffice.
Here's some benchmark. It'll look strange but the crudest code wins.
[code]:
[out]:
Data size (154MB) used in the benchmark above:
Some things to note:
sklearn
version, there's an overhead of vectorizer creation + numpy manipulation and conversion into aCounter
objectCounter
update version, it seems likeCounter.update()
is an expensive operation