I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.
import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']
How can I also get the array [0, 7]
corresponding to the raw indices of the tokens?
I think you are looking for is the
span_tokenize()
method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.Which gives:
just getting the offsets:
For further information (on the different tokenizers available) see the tokenize api docs
You can also do this:
And get: