I was going through this question.
Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.
I was going through this question.
Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.
The default
nltk.word_tokenize()
is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer.Do note that
str.split()
doesn't achieve tokens in the linguistics sense, e.g.:It is usually used to separate strings with specified delimiter, e.g. in a tab-separated file, you can use
str.split('\t')
or when you are trying to split a string by the newline\n
when your textfile has one sentence per line.And let's do some benchmarking in
python3
:[out]:
If we try a another tokenizers in bleeding edge NLTK from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:
[out]:
(Note: the source of the text file is from https://github.com/Simdiva/DSL-Task)
If we look at the native
perl
implementation, thepython
vsperl
time for theToktokTokenizer
is comparable. But do that in the python implementation the regexes are pre-compiled while in perl, it isn't but then the proof is still in the pudding:(Note: When timing the
tok-tok.pl
, we had to pipe the output into a file, so the timing here includes the time the machine takes to output to file, whereas in thenltk.tokenize.ToktokTokenizer
timing, it's doesn't include time to output into a file)With regards to
sent_tokenize()
, it's a little different and comparing speed benchmark without considering accuracy is a little quirky.Consider this:
If a regex splits a textfile/paragraph up in 1 sentence, then the speed is almost instantaneous, i.e. 0 work done. But that would be a horrible sentence tokenizer...
If sentences in a file is already separated by
\n
, then that is simply a case of comparing howstr.split('\n')
vsre.split('\n')
andnltk
would have nothing to do with the sentence tokenization ;PFor information on how
sent_tokenize()
works in NLTK, see:So to effectively compare
sent_tokenize()
vs other regex based methods (notstr.split('\n')
), one would have to evaluate also the accuracy and have a dataset with humanly evaluated sentence in a tokenized format.Consider this task: https://www.hackerrank.com/challenges/from-paragraphs-to-sentences
Given the text:
We want to get this:
So simply doing
str.split('\n')
will give you nothing. Even without considering the order of the sentences, you will yield 0 positive result: