I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word.
import nltk
from collections import Counter
def freq(string):
f = Counter()
sentence_list = nltk.tokenize.sent_tokenize(string)
for sentence in sentence_list:
words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
f[word] += 1
return f
I'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so.
Any way to speed up the above code? Thanks.
If you just want a flat list of tokens, note that
word_tokenize
would callsent_tokenize
implicitly, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L98Using brown corpus as an example, with
Counter(word_tokenize(string_corpus))
:~1.4 million words took 12 secs (without saving the tokenized corpus) on my machine with specs:
Saving the tokenized corpus first
tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)]
, then usingCounter(chain*(tokenized_corpus))
:Using
ToktokTokenizer()
Using
MosesTokenizer()
:Why use
MosesTokenizer
It was implemented in such a way that there is a way to reverse the tokens back to string, i.e. "detokenize".
Using
ReppTokenizer()
:Why use
ReppTokenizer
?It returns offset of the tokens from in the original string.
TL;DR
Advantages of different tokenizers
word_tokenize()
implicitly callssent_tokenize()
ToktokTokenizer()
is fastestMosesTokenizer()
is able to detokenize textReppTokenizer()
is able to provide token offsetsQ: Is there a fast tokenizer that can detokenizer and also provides me with offsets and also do sentence tokenization in NLTK ?
A: I don't think so, try
gensim
orspacy
.Unnecessary list creation is evil
Your code is implicitly creating a lot of potentially very long
list
instances which don't need to be there, for example:Using the
[...]
syntax for list comprehension creates a list of length n for n tokens found in your input, but all you want to do is get the frequency of each token, not actually store them:Therefore, you should use a generator instead:
Similarly,
nltk.tokenize.sent_tokenize
andnltk.tokenize.word_tokenize
both seem to produce lists as output, which is again unnecessary; Try to use a more low-level function, e.g.nltk.tokenize.api.StringTokenizer.span_tokenize
, which merely generates an iterator that yields token offsets for your input stream, i.e. pairs of indices of your input string representing each token.A better solution
Here is an example using no intermediate lists:
Disclaimer: I've not profiled this, so it's possible that e.g. the NLTK people have made
word_tokenize
blazingly fast but neglectedspan_tokenize
; Always profile your application to be sure.TL;DR
Don't use lists when generators will suffice: Every time you create a list just to throw it away after using it once, God kills a kitten.