In this documentation, there is example using nltk.collocations.BigramAssocMeasures()
, BigramCollocationFinder
,nltk.collocations.TrigramAssocMeasures()
, and TrigramCollocationFinder
.
There is example method find nbest based on pmi for bigram and trigram. example:
finder = BigramCollocationFinder.from_words(
... nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)
I know that BigramCollocationFinder
and TrigramCollocationFinder
inherit from AbstractCollocationFinder.
While BigramAssocMeasures()
and TrigramAssocMeasures()
inherit from NgramAssocMeasures.
How can I use the methods(e.g. nbest()
) in AbstractCollocationFinder
and NgramAssocMeasures
for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?
Should I create class which inherit AbstractCollocationFinder
?
Thanks.
Edited
The current NLTK has a hardcoder function for up to
QuadCollocationFinder
but the reasoning for why you cannot simply create anNgramCollocationFinder
still stands, you would have to radically change the formulas in thefrom_words()
function for different order of ngram.Short answer, no you cannot simply create an
AbstractCollocationFinder
(ACF) to call thenbest()
function if you want to find collocations beyond 2- and 3-grams.It's because of the difference in the
from_words()
for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have thefrom_words()
function.So given this
from_words()
in TrigramCF:You could somehow hack it and try to hardcode for a 4-gram association finder as such:
Then you would also have to change whichever part of the code that uses
cls
returned from thefrom_words
respectively.So you have to ask what is the ultimate purpose of finding the collocations?
If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.
If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.
If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz
here is the code
This will give output as