I am building a spam filter using the NLTK in Python. I now check for the occurances of words and use the NaiveBayesClassifier resulting in an accuracy of .98 and F measure for spam of .92 and for non-spam: 0.98. However when checking the documents in which my program errors I notice that a lot of spam that is classified as non-spam are very short messages.
So I want to put the length of a document as a feature for the NaiveBayesClassifier. The problem is it now only handles binary values. Is there any other way to do this than for example say: length<100 =true/false?
(p.s. I have build the spam detector analogous to the http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html example)
NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:
P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
= (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
= (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
= P(cl|doc) * P(len|cl) / P(len)
You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).
You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).
Here's one way of going about estimating P(len):
from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy
loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens))
mu = scipy.mean(loglens)
p = scipy.stats.norm(mu,sd)
The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:
p.cdf(log(L+1)) - p.cdf(log(L))
The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).
There are MultiNomial NaiveBayes algorithms that can handle range values, but not implemented in NLTK. For the NLTK NaiveBayesClassifier, you could try having a couple different length thresholds as binary features. I'd also suggest trying a Maxent Classifier to see how it handles smaller text.