I am attempting to implement a Naive Bayes classifier using BNT and MATLAB. So far I have been sticking with simple tabular_CPD
variables and "guesstimating" probabilities for the variables. My prototype net so far consists of the following:
DAG = false(5);
DAG(1, 2:5) = true;
bnet = mk_bnet(DAG, [2 3 4 3 3]);
bnet.CPD{1} = tabular_CPD(bnet, 1, [.5 .5]);
bnet.CPD{2} = tabular_CPD(bnet, 2, [.1 .345 .45 .355 .45 .3]);
bnet.CPD{3} = tabular_CPD(bnet, 3, [.2 .02 .59 .2 .2 .39 .01 .39]);
bnet.CPD{4} = tabular_CPD(bnet, 4, [.4 .33333 .5 .33333 .1 .33333]);
bnet.CPD{5} = tabular_CPD(bnet, 5, [.5 .33333 .4 .33333 .1 .33333]);
engine = jtree_inf_engine(bnet);
Here variable 1 is my desired output variable, set to initially assign a .5 probability to either output class.
Variables 2-5 define CPDs for features I measure:
- 2 is a cluster size, ranging from 1 to a dozen or more
- 3 is a ratio that will be a real value >= 1
- 4 and 5 are standard deviation (real) values (X and Y scatter)
In order to classify a candidate cluster I break all of the feature measurements into 3-4 range brackets, like so:
...
evidence = cell(1, 5);
evidence{2} = sum(M > [0 2 6]);
evidence{3} = sum(O > [0 1.57 2 3]);
evidence{4} = sum(S(1) > [-Inf 1 2]);
evidence{5} = sum(S(2) > [-Inf 0.4 0.8]);
eng = enter_evidence(engine, evidence);
marginals = marginal_nodes(eng, 1);
e = marginals.T(1);
...
This actually works pretty well, considering I'm only guessing at range brackets and probability values. But I believe that what I should be using here is a gaussian_CPD
. I think that a gaussian_CPD
can learn both the optimal brackets and probabilities (as mean and covariance matrices and weights).
My problem is, I am not finding any simple examples of how the BNT gaussian_CPD
class is used. How, for example, would I go about initializing a gaussian_CPD
to approximately the same behavior as one of my tabular_CPD
variables above?
I eventually figured this out by experimenting with BNT at the MATLAB command prompt. Here is how I defined my classifier net using
gaussian_CPD
nodes:To train it, I used my original classifier to help me label a set of 300 samples, then simply ran 2/3rds of them through the training algorithm.
The output from
dispcpt
gives a rough idea of the breakdown between class assignments in the labeled samples in the training set.To test the new classifier I ran the last 1/3rd of the results through both the original and the new Bayes nets. Here is the code I used for the new net:
Then to figure out if there was any improvement I plotted overlaid ROC diagrams. As it turned out, my original net did well enough that it was hard to tell for certain if the trained net using Gaussian CPDs did any better. Printing the areas under the ROC curves clarified that the new net did indeed perform slightly better. (
base area
is the original net, andarea
is the new one.)I'm posting this here so that next time I need to do this, I'll be able to find an answer... Hopefully someone else might find it useful as well.
I present a below a complete example that illustrates how to build a naive Bayes Net using the BNT Toolbox. I am using a subset of the cars dataset. It contains both discrete and continuous attributes.
Just for convenience, I am using a couple of functions that requires the Statistical toolbox.
We start by preparing the dataset:
Next we build our graphical model:
Now we split the data into training/testing:
Finally we learn the parameters from the training set, and predict the class of the test data:
The results:
and we can extract the CPT and mean/sigma at each node: