I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Thanks for the code snippet! Shouldn't:
be rather:
Oh ... I see was already answered ...
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary
unigram[word]
that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
Here's how we construct the unigram model first:
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of
0.01
. I already told you how to compute perplexity:Now we can test this on two different test sets:
for which you get the following result:
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word
Monty
was included in the unigram model, so the respective number for perplexity was also smaller.