I am using support vector classifier from sklearn on the Iris dataset. When I call decision_function
it returns negative values. But all samples in test dataset after classification has right class. I think that decision_function should return the positive value when the sample is an inlier and negative if the sample is an outlier. Where I am wrong?
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data[:,:]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3,
random_state=0)
clf = SVC(probability=True)
print(clf.fit(X_train,y_train).decision_function(X_test))
print(clf.predict(X_test))
print(y_test)
Here is the output:
[[-0.76231668 -1.03439531 -1.40331645]
[-1.18273287 -0.64851109 1.50296097]
[ 1.10803774 1.05572833 0.12956269]
[-0.47070432 -1.08920859 -1.4647051 ]
[ 1.18767563 1.12670665 0.21993744]
[-0.48277866 -0.98796232 -1.83186272]
[ 1.25020033 1.13721691 0.15514536]
[-1.07351583 -0.84997114 0.82303659]
[-1.04709616 -0.85739411 0.64601611]
[-1.23148923 -0.69072989 1.67459938]
[-0.77524787 -1.00939817 -1.08441968]
[-1.12212245 -0.82394879 1.11615504]
[-1.14646662 -0.91238712 0.80454974]
[-1.13632316 -0.8812114 0.80171542]
[-1.14881866 -0.95169643 0.61906248]
[ 1.15821271 1.10902205 0.22195304]
[-1.19311709 -0.93149873 0.78649126]
[-1.21653084 -0.90953622 0.78904491]
[ 1.16829526 1.12102515 0.20604678]
[ 1.18446364 1.1080255 0.15199149]
[-0.93911991 -1.08150089 -0.8026332 ]
[-1.15462733 -0.95603159 0.5713605 ]
[ 0.93278883 0.99763184 0.34033663]
[ 1.10999556 1.04596018 0.14791409]
[-1.07285663 -1.01864255 -0.10701465]
[ 1.21200422 1.01284263 0.0416991 ]
[ 0.9462457 1.01076579 0.36620915]
[-1.2108146 -0.79124775 1.43264808]
[-1.02747495 -0.25741977 1.13056021]
...
[ 1.16066886 1.11212424 0.22506538]]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
2 1 1 2 0 2 0 0]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
1 1 1 2 0 2 0 0]
Christopher is correct, but assuming OvR here.
Now you are doing the OvO scheme without noticing it!
Here some example, which:
But first OvO's theory on prediction from:
Code:
Output:
You need to consider the decision_function and the prediction separately. The decision is the distance from the hyperplane to your sample. This means by looking at the sign you can tell if your sample is located right or left to the hyperplane. So negative values are perfectly fine and indicate the negative class ("the other side of the hyperplane").
With the iris dataset you have a multi-class problem. As the SVM is binary classifier, there is no inherent multi-class classification. Two approaches are the "one-vs-rest" (OvR) and "one-vs-one" methods, which construct a multi-class classifier from the binary "units".
One-vs-one
Now that you already know OvR, OvA is not that much harder to grasp. You basically construct a classifier of every combination of class pairs (A, B). In your case: 0 vs 1, 0 vs 2, 1 vs 2.
Note: The values of (A, B) and (B, A) can be obtained from a single binary classifier. You only change what is considered the positive class and thus you have to invert the sign.
Doing this gives you a matrix:
Read this as following: Decision function value when class A (row) competes against class B (column).
In order to extract a result a vote is performed. In the basic form you can imagine this as a single vote that each classifier can give: Yes or No. This could lead to draws, so we use the whole decision function values instead.
The resulting columns gives you again a vector
[-1.82, 2.68, 0.86]
. Now applyarg max
and it matches your prediction.One-vs-rest
I keep this section to avoid further confusion. The scikit-lear SVC classifier (libsvm) has a
decision_function_shape
parameter, which deceived me into thinking it was OvR (i am using liblinear most of the time).For a real OvR respone you get one value from the decision function per classifier, e.g.
Now to obtain a prediction from this you could just apply
arg max
, which would return the last index with a value of1.50296097
. From here on the decision function's value is not needed anymore (for this single prediction). That's why you noticed that your predictions are fine.However you also specified
probability=True
, which uses the value of the distance_function and passes it to a sigmoid function. Sample principle as above, but now you also have confidence values (i prefer this term over probabilities, since it only describes the distance to the hyperplane) between 0 and 1.Edit: Oops, sascha is right. LibSVM uses one-vs-one (despite the shape of the decision function).