I have given a train text and a test text. What I want to do is to train a language model by train data to calculate the perplexity of the test data.
This is my code:
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk import word_tokenize, sent_tokenize
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(n)
model.fit(train_data, padded_sents)
print(model.perplexity(trainTest))
When I run this code with n=1, which is unigram, I get "1068.332393940235"
. With n=2, or bigram, I get "1644.3441077259993"
, and with trigrams I get 2552.2085752565313
.
What is the problem with it?
The way you are creating the test data is wrong (lower case train data but test data is not coverted to lowercase. Start and end tokens missing in test data). Try this