There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text)
and nltk.word_tokenize(sentence)
.
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos)
where pos
is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens)
)
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
- Introduction to N-grams
- Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}