I recently wrote an implementation of Naive Bayes to classify examples into one of 5 different groups. The number of features n was very large, and each feature could be on (1) or off (0). Using the training set, I estimated an 5 × n matrix P of conditional probabilities for each group Gi against each feature Fj, for 1≤i≤5, 1≤j≤n, so that cell (i,j) = P(Gi=1|Fj=1). (I'm ignoring the probabilities P(Gi=1|Fj=0) as they aren't relevant to this discussion.)
What I wanted to do was, given a new example E, a 1 × n vector, multiply together the conditional probabilities from the matrix P corresponding to features which were on in the new example. I had two concerns in doing this:
- the very large number of features means that a loop would be very slow
- repeated multiplication might lead to loss of accuracy
What I did was to take the log of P, L=log(P) and then perform the matrix multiplication E L'. The multiplication gives a 1 × 5 result and the maximum of the result indicates which group, assuming prior probabilities about equal. This solves the speed issue through vectorization and the accuracy issue by taking logs (and, of course, taking logs transforms multiplication to addition). Another advantage is that E L' would work for a set of training examples where E was a matrix rather than a vector.
My question is, is taking logs like this a reasonable / standard approach? It seems like it might be the obvious, "101" approach, but I have limited experience of implementing algorithms like this so I would appreciate feedback from those with more experience.
For reference, in the Naive Bayes approach, Bayes theorem gives the probability of being in group g conditional on features F= f as
P(G=g|F= f ) = P(F= f|G=g)P(G=g) / P(F= f )
Expanding the features vector F into F1..n gives
P(G=g|F1= f1, F2= f2... Fn= fn) = P(F1= f1, F2= f2... Fn= fn|G=g)P(G=g) / P(F= f )
Applying the Naive assumption of independent features
P(G=g|F1= f1, F2= f2... Fn= fn) = P(F1= f1|G=g) P(F2= f2|G=g) ... P(Fn= fn|G=g)P(G=g) / P(F= f )
The denominator can be dropped as it's the same for all g, so we have
P(G=g|F1= f1, F2= f2... Fn= fn) ∝ P(F1= f1|G=g) P(F2= f2|G=g) ... P(Fn= fn|G=g)P(G=g)
Here, P(G=g) is the prior probability.
I assume you have rewritten the multiplication of E P' to deal with the fact that you're representing P by log(P)?
Representing conditional probabilities with logs of probabilities is a very common technique to work around the fact that they can get so freaking small.
Many of the implementations of robust classifiers in automatic target recognition applications (i.e. Dempster-Schafer for instance) force probabilities to always be non-zero. What you're proposing is another way at getting that.