There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()
for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.
Gives:
The reason
tokenize.untokenize
does not work is because it needs more information than just the words. Here is an example program usingtokenize.untokenize
:Additional Help: Tokenize - Python Docs | Potential Problem
To reverse
word_tokenize
fromnltk
, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.Short of doing crazy hacks on nltk, you can try this:
Use the join function:
You could just do a
' '.join(words)
to get back the original string.