I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK?
They should not result in:
['runs','in','my','family','4x','a','day']
For example:
Yes 20-30 minutes a day on my bike, it works great!!
gives:
['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great']
I want '20-30 minutes' to be treated as a single word. How can I get this behavior>?
You can use the
MWETokenizer
:[out]:
A more principled approach since you don't know how `word_tokenize will split the words you want to keep:
[out]:
You will be hard pressed to preserve n-grams of various length at the same time as tokenizing, to my knowledge, but you can find these n-grams as shown here. Then, you could replace the items in the corpus you want as n-grams with some joining character like dashes.
This is an example solution, but there are probably lots of ways to get there. Important note: I provided a way to find ngrams that are common in the text (you will probably want more than 1, so I put a variable there so that you can decide how many of the ngrams to collect. You might want a different number for each kind, but I only gave 1 variable for now.) This may miss ngrams you find important. For that, you can add ones you want to find to
user_grams
. Those will get added to the search.This section finds common ngrams up to five_grams.
This section lets you add your own ngrams to a list
And this last part performs the processing so that you can tokenize again and get the ngrams as tokens.
I think this is actually a very good question.