Simple example/use-case for a BNT gaussian_CPD?

2019-04-02 12:19发布

问题:

I am attempting to implement a Naive Bayes classifier using BNT and MATLAB. So far I have been sticking with simple tabular_CPD variables and "guesstimating" probabilities for the variables. My prototype net so far consists of the following:

DAG = false(5);
DAG(1, 2:5) = true;
bnet = mk_bnet(DAG, [2 3 4 3 3]);
bnet.CPD{1} = tabular_CPD(bnet, 1, [.5  .5]);
bnet.CPD{2} = tabular_CPD(bnet, 2, [.1  .345   .45 .355   .45 .3]);
bnet.CPD{3} = tabular_CPD(bnet, 3, [.2  .02    .59 .2     .2  .39    .01 .39]);
bnet.CPD{4} = tabular_CPD(bnet, 4, [.4  .33333 .5  .33333 .1  .33333]);
bnet.CPD{5} = tabular_CPD(bnet, 5, [.5  .33333 .4  .33333 .1  .33333]);
engine = jtree_inf_engine(bnet);

Here variable 1 is my desired output variable, set to initially assign a .5 probability to either output class.

Variables 2-5 define CPDs for features I measure:

  • 2 is a cluster size, ranging from 1 to a dozen or more
  • 3 is a ratio that will be a real value >= 1
  • 4 and 5 are standard deviation (real) values (X and Y scatter)

In order to classify a candidate cluster I break all of the feature measurements into 3-4 range brackets, like so:

...
    evidence = cell(1, 5);
    evidence{2} = sum(M > [0 2 6]);
    evidence{3} = sum(O > [0 1.57 2 3]);
    evidence{4} = sum(S(1) > [-Inf 1 2]);
    evidence{5} = sum(S(2) > [-Inf 0.4 0.8]);
    eng = enter_evidence(engine, evidence);
    marginals = marginal_nodes(eng, 1);
    e = marginals.T(1);
...

This actually works pretty well, considering I'm only guessing at range brackets and probability values. But I believe that what I should be using here is a gaussian_CPD. I think that a gaussian_CPD can learn both the optimal brackets and probabilities (as mean and covariance matrices and weights).

My problem is, I am not finding any simple examples of how the BNT gaussian_CPD class is used. How, for example, would I go about initializing a gaussian_CPD to approximately the same behavior as one of my tabular_CPD variables above?

回答1:

I eventually figured this out by experimenting with BNT at the MATLAB command prompt. Here is how I defined my classifier net using gaussian_CPD nodes:

DAG = false(5); DAG(1, 2:5) = true
bnet = mk_bnet(DAG, [2 1 1 2 1], 'discrete', 1);
bnet.CPD{1} = tabular_CPD(bnet, 1, 'prior_type', 'dirichlet');
for node = 2:5
   bnet.CPD{node} = gaussian_CPD(bnet, node);
end
bnet

DAG =

     0     1     1     1     1
     0     0     0     0     0
     0     0     0     0     0
     0     0     0     0     0
     0     0     0     0     0

bnet = 

               equiv_class: [1 2 3 4 5]
                    dnodes: 1
                  observed: []
                     names: {}
                    hidden: [1 2 3 4 5]
               hidden_bitv: [1 1 1 1 1]
                       dag: [5x5 logical]
                node_sizes: [2 1 1 2 1]
                    cnodes: [2 3 4 5]
                   parents: {[1x0 double]  [1]  [1]  [1]  [1]}
    members_of_equiv_class: {[1]  [2]  [3]  [4]  [5]}
                       CPD: {[1x1 tabular_CPD]  [1x1 gaussian_CPD]  [1x1 gaussian_CPD]  [1x1 gaussian_CPD]  [1x1 gaussian_CPD]}
             rep_of_eclass: [1 2 3 4 5]
                     order: [1 5 4 3 2]

To train it, I used my original classifier to help me label a set of 300 samples, then simply ran 2/3rds of them through the training algorithm.

bnet = learn_params(bnet, lsamples);
CPD = struct(bnet.CPD{1}); % Peek inside CPD{1}
dispcpt(CPD.CPT);

1 : 0.6045 
2 : 0.3955

The output from dispcpt gives a rough idea of the breakdown between class assignments in the labeled samples in the training set.

To test the new classifier I ran the last 1/3rd of the results through both the original and the new Bayes nets. Here is the code I used for the new net:

engine = jtree_inf_engine(bnet);
evidence = cell(1, 5);
tresults = cell(3, length(tsamples));
tresults(3, :) = tsamples(1, :);
for i = 1:length(tsamples)
    evidence(2:5) = tsamples(2:5, i);
    marginal = marginal_nodes(enter_evidence(engine, evidence), 1);
    tresults{1, i} = find(marginal.T == max(marginal.T)); % Generic decision point
    tresults{2, i} = marginal.T(1);
end
tresults(:, 1:8)

ans = 

    [         2]    [     1]    [         2]    [         2]    [         2]    [     1]    [     1]    [     1]
    [1.8437e-10]    [0.9982]    [3.3710e-05]    [3.8349e-04]    [2.2995e-11]    [0.9997]    [0.9987]    [0.5116]
    [         2]    [     1]    [         2]    [         2]    [         2]    [     1]    [     1]    [     2]

Then to figure out if there was any improvement I plotted overlaid ROC diagrams. As it turned out, my original net did well enough that it was hard to tell for certain if the trained net using Gaussian CPDs did any better. Printing the areas under the ROC curves clarified that the new net did indeed perform slightly better. (base area is the original net, and area is the new one.)

conf = cell2mat(tresults(2,:));
hit = cell2mat(tresults(3,:)) == 1;
[~, ~, basearea] = plotROC(baseconf, basehit, 'r')
hold all;
[~, ~, area] = plotROC(conf, hit, 'b')
hold off;

basearea =

    0.9371

area =

    0.9555

I'm posting this here so that next time I need to do this, I'll be able to find an answer... Hopefully someone else might find it useful as well.



回答2:

I present a below a complete example that illustrates how to build a naive Bayes Net using the BNT Toolbox. I am using a subset of the cars dataset. It contains both discrete and continuous attributes.

Just for convenience, I am using a couple of functions that requires the Statistical toolbox.

We start by preparing the dataset:

%# load dataset
D = load('carsmall');

%# keep only features of interest
D = rmfield(D, {'Mfg','Horsepower','Displacement','Model'});

%# filter the rows to keep only two classes
idx = ismember(D.Origin, {'USA' 'Japan'});
D = structfun(@(x)x(idx,:), D, 'UniformOutput',false);
numInst = sum(idx);

%# replace missing values with mean
D.MPG(isnan(D.MPG)) = nanmean(D.MPG);

%# convert discrete attributes to numeric indices 1:mx
[D.Origin,~,gnOrigin] = grp2idx( cellstr(D.Origin) );
[D.Cylinders,~,gnCylinders] = grp2idx( D.Cylinders );
[D.Model_Year,~,gnModel_Year] = grp2idx( D.Model_Year );

Next we build our graphical model:

%# info about the nodes
nodeNames = fieldnames(D);
numNodes = numel(nodeNames);
node = [nodeNames num2cell((1:numNodes)')]';
node = struct(node{:});
dNodes = [node.Origin node.Cylinders node.Model_Year];
cNodes = [node.MPG node.Weight node.Acceleration];
depNodes = [node.MPG node.Cylinders node.Weight ...
            node.Acceleration node.Model_Year];

vals = cell(1,numNodes);
vals(dNodes) = cellfun(@(f) unique(D.(f)), nodeNames(dNodes), 'Uniform',false);
nodeSize = ones(1,numNodes);
nodeSize(dNodes) = cellfun(@numel, vals(dNodes));

%# DAG
dag = false(numNodes);
dag(node.Origin, depNodes) = true;

%# create naive bayes net
bnet = mk_bnet(dag, nodeSize, 'discrete',dNodes, 'names',nodeNames, ...
    'observed',depNodes);
for i=1:numel(dNodes)
    name = nodeNames{dNodes(i)};
    bnet.CPD{dNodes(i)} = tabular_CPD(bnet, node.(name), ...
        'prior_type','dirichlet');
end
for i=1:numel(cNodes)
    name = nodeNames{cNodes(i)};
    bnet.CPD{cNodes(i)} = gaussian_CPD(bnet, node.(name));
end

%# visualize the graph
[~,~,h] = draw_graph(bnet.dag, nodeNames);
hTxt = h(:,1); hNodes = h(:,2);
set(hTxt(node.Origin), 'FontWeight','bold', 'Interpreter','none')
set(hNodes(node.Origin), 'FaceColor','g')
set(hTxt(depNodes), 'Color','k', 'Interpreter','none')
set(hNodes(depNodes), 'FaceColor','y')

Now we split the data into training/testing:

%# build samples as cellarray
data = num2cell(cell2mat(struct2cell(D)')');

%# split train/test: 1/3 for testing, 2/3 for training
cv = cvpartition(D.Origin, 'HoldOut',1/3);
trainData = data(:,cv.training);
testData = data(:,cv.test);
testData(1,:) = {[]};    %# remove class

Finally we learn the parameters from the training set, and predict the class of the test data:

%# training
bnet = learn_params(bnet, trainData);

%# testing
prob = zeros(nodeSize(node.Origin), sum(cv.test));
engine = jtree_inf_engine(bnet);         %# Inference engine
for i=1:size(testData,2)
    [engine,loglik] = enter_evidence(engine, testData(:,i));
    marg = marginal_nodes(engine, node.Origin);
    prob(:,i) = marg.T;

end
[~,pred] = max(prob);
actual = D.Origin(cv.test)';

%# confusion matrix
predInd = full(sparse(1:numel(pred),pred,1));
actualInd = full(sparse(1:numel(actual),actual,1));
conffig(predInd, actualInd);             %# confmat

%# ROC plot and AUC
figure
[~,~,auc] = plotROC(max(prob), pred==actual, 'b')
title(sprintf('Area Under the Curve = %g',auc))
set(findobj(gca, 'type','line'), 'LineWidth',2)

The results:

and we can extract the CPT and mean/sigma at each node:

cellfun(@(x)dispcpt(struct(x).CPT), bnet.CPD(dNodes), 'Uniform',false)
celldisp(cellfun(@(x)struct(x).mean, bnet.CPD(cNodes), 'Uniform',false))
celldisp(cellfun(@(x)struct(x).cov, bnet.CPD(cNodes), 'Uniform',false))