I'm new to SVM. I used Libsvm for Matlab, and after a prediction phase I've got a decision values array. From SVM theory: each test record z is assigned as positive if
f(z)=1
where f(z) is defined as
f(z)=sign(w*z+b)
So how can I relate the decision value from the array for an instance z with f(z)?
Is the prediction based on decision value so: if dec_value>0 then z is positive otherwise z is negative?
Yes, you are correct, if f(z) is positive, then the instance belongs to class +1, if its negative it belongs to class -1. The value of f(z) is not interpretable.
While the function:
f(z) = sign(w*z+b)
looks like an equation for a hyperplane, it differs in that w is not a normal vector - its length is not 1, so the value of f(z) is not the distance from the hyperplane, this is why it is specified as sign(..), to make it clear the value is only used to determine which side of the hyperplane the instance falls on.
Some background:
The goal is to find the hyperplane which gives the maximum margin between the two classes:
So, the purpose is to maximize the margin, which is , therefore minimizing . Remember, usually when w is used to represent a hyperplane as the normal vector, is 1. That isn't the case here clearly, as there would be no optimization problem. Instead of keeping = 1 and varying the width of the margin, we've fixed the width of the margin to 2 and are allowing to vary in size instead.
This gives us the primal optimization problem (with soft margins):
This seems to be what you are referring to. However, this equation comes from basic soft maximum margin classifier, which is foundation of SVM. True SVM is formulated as a Lagrangian dual to allow the use of kernels. The neat thing about SVM is that when the above problem (and its constraints) are formulated in the Lagrangian, all the variables except for the lagrangian multipliers drop out, leaving us with the following problem:
Notice there is no w. The training points x (y are the labels, 1 or -1), now only appear together as a dot product, allowing us to employ the kernel trick to obtain a non-linear model.
But if we don't have w what is our decision function? It becomes a function of our support vectors and the lagrangian multipliers we found.
This is what libsvm produces and what it stores as the model you have trained. It stores the support vectors and the associated alphas. For linear SVM, you can obtain the primal w, this is explained here in the LibSVM FAQ, but it is not going to be what you get back automatically from LibSVM, and this can only be done for the linear kernel.
The value of the SVM decision function based on the lagrangian multipliers and support vectors should only be interpreted by its sign as well.
Reading the docs tells me that:
The third [return value] is a matrix containing decision values or probability
estimates (if '-b 1' is specified). If k is the number of classes in
training data, for decision values, each row includes results of
predicting k(k-1)/2 binary-class SVM's.
So for a two-class problems, what you get is a vector containing the decision values f(z)
, so this means everything belonging to the first class has d<0 and everything belonging to the second class has d>0.
In general: libsvm considers its first class to be the first label it gets and so on. So for having reliable results you'd need to sort your data first.
In the binary case this holds too: whatever labels you feed svmtrain, it will take the first it encounters as class 1, and the second as class -1. This is trivially verifiable by feeding it a trivial dataset:
Y = [-ones(100,1);ones(100,1)];
m = svmtrain(Y,Y); % train with all labels as data (never do this in practices, not the "all" part, not the training on labels part ;)
[a,b,c] = svmpredict(Y,Y,m); % of course this will give 100% accuracy.
b' % you can see that the first label will always have an internal representation of 1.
For multi-class classification this is different: it then contains k(k-1)/2 entries, corresponding to all one-against-all class scenarios for each pixel.
This means for eg. a 4 class problem you'd have 4*3/2 = 6 values for each sample:
[ f12(z) f13(z) f14(z) f23(z) f24(z) f34(z)]
Now how these function values map to classes through one-against-all I could not really deduce easily from looking at the code... But I guess you where mostly interested in the 2 class case anyhow, no?