When testing the naive classifier in matlab I get different results even though I trained and tested on the same sample data, I was wondering if my code is correct and if someone could help explain why this is?
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
%indY = randperm( size(fulldata,2) );
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
%% apply normalization method to every cell
data = zscore(data);
%create a training set the same as datasample
training_data = data;
%match the class labels to the corresponding rows
target_class = classlabels(indX,:)
%classify the same data sample to check if naive bayes works
class = classify(data, training_data, target_class, 'diaglinear')
confusionmat(test_class, class)
Here is an example:
Notice it got ipsweep, teardrop and back mixed up with normal traffic. I haven't gotten to the stage of classifying unseen data yet I just wanted to test if it would classify the same data.
The confusion matrix output:
ans =
537 0 0 0 0 0 0 1 0
0 224 0 0 0 1 0 1 0
0 0 91 79 0 17 24 4 0
0 0 0 8 0 0 2 0 0
0 0 0 0 3 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 2 0 0
0 0 0 0 0 0 0 3 0
0 0 0 0 0 1 0 0 1
Although I have no clue what this actually is and I probably got this wrong in my code but I thought I would just test to see what it outputs.
You are using a classifier on data of reduced dimensionality. A classifier is meant to be slightly imprecise because it needs to generalize. In the dimensionality reduction stage you are loosing information which also leads to reduced classification performance.
Don't expect perfect performance even on the training set, this would be a bad case of over-fitting.
As for the use of the confusion matrix.
C(3,4)=79
means nothing more than that for 79 data points the class should be 3 and they got classified as class 4. The complete matrix says that your classifier works well for classes 1 and 2 but has problems with class 3. The rest of the classes have almost no data so it is difficult to judge how good the classifier works for them.