我试图实现使用BNT和MATLAB朴素贝叶斯分类器。 到目前为止,我一直在坚持用简单的tabular_CPD
变量和“guesstimating”为变量的概率。 我的原型净为止包括以下内容:
DAG = false(5);
DAG(1, 2:5) = true;
bnet = mk_bnet(DAG, [2 3 4 3 3]);
bnet.CPD{1} = tabular_CPD(bnet, 1, [.5 .5]);
bnet.CPD{2} = tabular_CPD(bnet, 2, [.1 .345 .45 .355 .45 .3]);
bnet.CPD{3} = tabular_CPD(bnet, 3, [.2 .02 .59 .2 .2 .39 .01 .39]);
bnet.CPD{4} = tabular_CPD(bnet, 4, [.4 .33333 .5 .33333 .1 .33333]);
bnet.CPD{5} = tabular_CPD(bnet, 5, [.5 .33333 .4 .33333 .1 .33333]);
engine = jtree_inf_engine(bnet);
这里变量1是我的期望的输出变量,设定为初始分配一个0.5的概率,以任一输出类。
变量2-5定义功能的CPD我衡量:
- 图2是一个簇的大小,范围从1到一打或更多
- 图3是一个比,这将是一个实数值> = 1
- 4和5是标准偏差(真实)值(X和Y散射)
为了候选人集群分类我打破所有的特征测量到3-4范围内括号,就像这样:
...
evidence = cell(1, 5);
evidence{2} = sum(M > [0 2 6]);
evidence{3} = sum(O > [0 1.57 2 3]);
evidence{4} = sum(S(1) > [-Inf 1 2]);
evidence{5} = sum(S(2) > [-Inf 0.4 0.8]);
eng = enter_evidence(engine, evidence);
marginals = marginal_nodes(eng, 1);
e = marginals.T(1);
...
这实际上工作得很好,考虑到我只是在括号范围和概率值的猜测。 但我相信,我应该用在这里是一个gaussian_CPD
。 我认为 ,一个gaussian_CPD
可以同时学习的最佳支架和可能性(如均值和方差矩阵和权重)。
我的问题是,我没有找到的BNT如何任何简单的例子gaussian_CPD
类使用。 怎么样,比如,我会去初始化gaussian_CPD
以大致相同的行为,我的一个tabular_CPD
上述变量?
我最终通过实验想通了这一点BNT在MATLAB命令提示符。 这是我如何定义使用我的分类网gaussian_CPD
节点:
DAG = false(5); DAG(1, 2:5) = true
bnet = mk_bnet(DAG, [2 1 1 2 1], 'discrete', 1);
bnet.CPD{1} = tabular_CPD(bnet, 1, 'prior_type', 'dirichlet');
for node = 2:5
bnet.CPD{node} = gaussian_CPD(bnet, node);
end
bnet
DAG =
0 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
bnet =
equiv_class: [1 2 3 4 5]
dnodes: 1
observed: []
names: {}
hidden: [1 2 3 4 5]
hidden_bitv: [1 1 1 1 1]
dag: [5x5 logical]
node_sizes: [2 1 1 2 1]
cnodes: [2 3 4 5]
parents: {[1x0 double] [1] [1] [1] [1]}
members_of_equiv_class: {[1] [2] [3] [4] [5]}
CPD: {[1x1 tabular_CPD] [1x1 gaussian_CPD] [1x1 gaussian_CPD] [1x1 gaussian_CPD] [1x1 gaussian_CPD]}
rep_of_eclass: [1 2 3 4 5]
order: [1 5 4 3 2]
要训练它,我用我原来的分类,帮我标注一组300个样本,然后只需通过训练算法跑了他们的2 /三分之二。
bnet = learn_params(bnet, lsamples);
CPD = struct(bnet.CPD{1}); % Peek inside CPD{1}
dispcpt(CPD.CPT);
1 : 0.6045
2 : 0.3955
从输出dispcpt
给出了训练集的标记的样品在课堂作业之间的击穿的一个粗略的想法。
为了测试新的分类我通过原始和新的贝叶斯网跑了结果的最后1/3。 下面是我用纯新的代码:
engine = jtree_inf_engine(bnet);
evidence = cell(1, 5);
tresults = cell(3, length(tsamples));
tresults(3, :) = tsamples(1, :);
for i = 1:length(tsamples)
evidence(2:5) = tsamples(2:5, i);
marginal = marginal_nodes(enter_evidence(engine, evidence), 1);
tresults{1, i} = find(marginal.T == max(marginal.T)); % Generic decision point
tresults{2, i} = marginal.T(1);
end
tresults(:, 1:8)
ans =
[ 2] [ 1] [ 2] [ 2] [ 2] [ 1] [ 1] [ 1]
[1.8437e-10] [0.9982] [3.3710e-05] [3.8349e-04] [2.2995e-11] [0.9997] [0.9987] [0.5116]
[ 2] [ 1] [ 2] [ 2] [ 2] [ 1] [ 1] [ 2]
然后要弄清楚是否有我绘制任何改善覆盖ROC图。 事实证明,我的原净做不够好,这是很难说的肯定,如果使用高斯的CPD训练的网络没有任何好转。 打印ROC曲线下面积澄清说,新的净确实执行略胜一筹。 ( base area
是原净,和area
是新的。)
conf = cell2mat(tresults(2,:));
hit = cell2mat(tresults(3,:)) == 1;
[~, ~, basearea] = plotROC(baseconf, basehit, 'r')
hold all;
[~, ~, area] = plotROC(conf, hit, 'b')
hold off;
basearea =
0.9371
area =
0.9555
我在这里张贴这一点,以便下一次我要做到这一点,我就可以找到答案...希望别人会发现它也是有用的。
我提出,说明如何使用BNT工具箱建立一个朴素贝叶斯网络一个完整的示例以下。 我使用的一个子集汽车集 。 它包含了离散和连续属性。
只是为了方便起见,我使用了几个功能,需要统计工具箱。
首先,我们要准备的数据集:
%# load dataset
D = load('carsmall');
%# keep only features of interest
D = rmfield(D, {'Mfg','Horsepower','Displacement','Model'});
%# filter the rows to keep only two classes
idx = ismember(D.Origin, {'USA' 'Japan'});
D = structfun(@(x)x(idx,:), D, 'UniformOutput',false);
numInst = sum(idx);
%# replace missing values with mean
D.MPG(isnan(D.MPG)) = nanmean(D.MPG);
%# convert discrete attributes to numeric indices 1:mx
[D.Origin,~,gnOrigin] = grp2idx( cellstr(D.Origin) );
[D.Cylinders,~,gnCylinders] = grp2idx( D.Cylinders );
[D.Model_Year,~,gnModel_Year] = grp2idx( D.Model_Year );
接下来,我们建立我们的图形模型:
%# info about the nodes
nodeNames = fieldnames(D);
numNodes = numel(nodeNames);
node = [nodeNames num2cell((1:numNodes)')]';
node = struct(node{:});
dNodes = [node.Origin node.Cylinders node.Model_Year];
cNodes = [node.MPG node.Weight node.Acceleration];
depNodes = [node.MPG node.Cylinders node.Weight ...
node.Acceleration node.Model_Year];
vals = cell(1,numNodes);
vals(dNodes) = cellfun(@(f) unique(D.(f)), nodeNames(dNodes), 'Uniform',false);
nodeSize = ones(1,numNodes);
nodeSize(dNodes) = cellfun(@numel, vals(dNodes));
%# DAG
dag = false(numNodes);
dag(node.Origin, depNodes) = true;
%# create naive bayes net
bnet = mk_bnet(dag, nodeSize, 'discrete',dNodes, 'names',nodeNames, ...
'observed',depNodes);
for i=1:numel(dNodes)
name = nodeNames{dNodes(i)};
bnet.CPD{dNodes(i)} = tabular_CPD(bnet, node.(name), ...
'prior_type','dirichlet');
end
for i=1:numel(cNodes)
name = nodeNames{cNodes(i)};
bnet.CPD{cNodes(i)} = gaussian_CPD(bnet, node.(name));
end
%# visualize the graph
[~,~,h] = draw_graph(bnet.dag, nodeNames);
hTxt = h(:,1); hNodes = h(:,2);
set(hTxt(node.Origin), 'FontWeight','bold', 'Interpreter','none')
set(hNodes(node.Origin), 'FaceColor','g')
set(hTxt(depNodes), 'Color','k', 'Interpreter','none')
set(hNodes(depNodes), 'FaceColor','y')
现在,我们分割数据为训练/测试:
%# build samples as cellarray
data = num2cell(cell2mat(struct2cell(D)')');
%# split train/test: 1/3 for testing, 2/3 for training
cv = cvpartition(D.Origin, 'HoldOut',1/3);
trainData = data(:,cv.training);
testData = data(:,cv.test);
testData(1,:) = {[]}; %# remove class
最后,我们学会从训练集的参数,并预测类的测试数据:
%# training
bnet = learn_params(bnet, trainData);
%# testing
prob = zeros(nodeSize(node.Origin), sum(cv.test));
engine = jtree_inf_engine(bnet); %# Inference engine
for i=1:size(testData,2)
[engine,loglik] = enter_evidence(engine, testData(:,i));
marg = marginal_nodes(engine, node.Origin);
prob(:,i) = marg.T;
end
[~,pred] = max(prob);
actual = D.Origin(cv.test)';
%# confusion matrix
predInd = full(sparse(1:numel(pred),pred,1));
actualInd = full(sparse(1:numel(actual),actual,1));
conffig(predInd, actualInd); %# confmat
%# ROC plot and AUC
figure
[~,~,auc] = plotROC(max(prob), pred==actual, 'b')
title(sprintf('Area Under the Curve = %g',auc))
set(findobj(gca, 'type','line'), 'LineWidth',2)
结果:
我们可以提取CPT和平均在每个节点/西格玛:
cellfun(@(x)dispcpt(struct(x).CPT), bnet.CPD(dNodes), 'Uniform',false)
celldisp(cellfun(@(x)struct(x).mean, bnet.CPD(cNodes), 'Uniform',false))
celldisp(cellfun(@(x)struct(x).cov, bnet.CPD(cNodes), 'Uniform',false))