There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()
for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
use
token_utils.untokenize
from hereFor me, it worked when I installed python nltk 3.2.5,
then,
If you are using insides pandas dataframe, then
The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:
1) The original string
2) The original tokens
3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)
Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.
Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.
You can use "treebank detokenizer" -
TreebankWordDetokenizer
:There is also
MosesDetokenizer
which was innltk
but got removed because of the licensing issues, but it is available as aSacremoses
standalone package.I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens