NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box.
>>> nltk.word_tokenize("(Dr. Edwards is my friend.)")
['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')']
I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens.
By offset I mean 2-ples that can serve as indexes into the original string. For example here I'd have
>>> s = "(Dr. Edwards is my friend.)"
>>> s.token_spans()
[(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]
because s[0:1] is "(", s[1:4] is "Dr." and so forth.
Is there a single NLTK call that does this, or do I have to write my own offset arithmetic?
Yes, most Tokenizers in nltk have a method called
span_tokenize
but unfortunately the Tokenizer you are using doesn't.By default the
word_tokenize
function uses a TreebankWordTokenizer. TheTreebankWordTokenizer
implementation has a fairly robust implementation but currently it lacks an implementation for one important method,span_tokenize
.I see no implementation of
span_tokenize
for aTreebankWordTokenizer
so I believe you will need to implement your own. Subclassing TokenizerI can make this process a little less complex.You might find the
span_tokenize
method ofPunktWordTokenizer
useful as a starting point.I hope this info helps.
pytokenizations
have a useful functionget_original_spans
to get the spans:See the documentation for other useful functions.
At least since NLTK 3.4 TreebankWordTokenizer supports
span_tokenize
: