I use the Naive Bayes classifier in Python NLTK to compute the probability distribution for the following example:
import nltk
def main():
train = [(dict(feature=1), 'class_x'), (dict(feature=0), 'class_x'), (dict(feature=0), 'class_y'), (dict(feature=0), 'class_y')]
test = [dict(feature=1)]
classifier = nltk.classify.NaiveBayesClassifier.train(train)
print("classes available: ", sorted(classifier.labels()))
print ("input assigned to: ", classifier.classify_many(test))
for pdist in classifier.prob_classify_many(test):
print ("probability distribution: ")
print ('%.4f %.4f' % (pdist.prob('class_x'), pdist.prob('class_y')))
if __name__ == '__main__':
main()
There are two classes (class_x and class_y) in the training dataset. Two inputs are given to each of the classes. For class_x, the first input feature has a value of 1, and the second a value of 0. For class_y, both input features have a value of 0. The test dataset is made up of one input, with a value of 1.
When I run the code, the output is:
classes available: ['class_x', 'class_y']
input assigned to: ['class_x']
0.7500 0.2500
To get the probabilities, or likelihoods, for each class, the classifier should multiply the prior of the class (in this case, 0.5) by the probabilities of each of the features in the class. Smoothing should be considered.
I usually use a formula similar to this (or a similar variant):
P(feature|class) = prior of class * frequency of feature in class +1 / total features in class + Vocabulary size. Smoothing can vary and slightly changes the outcome.
In the example code above, how exactly does the classifier compute the probability distribution? What is the formula used?
I checked here and here, but could not get any information as to exactly how the computation is done.
Thanks in advance.
From the source code
https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L9yo