I'm new to R and I'm using the e1071
package for SVM classification in R.
I used the following code:
data <- loadNumerical()
model <- svm(data[,-ncol(data)], data[,ncol(data)], gamma=10)
print(predict(model, data[c(1:20),-ncol(data)]))
The loadNumerical
is for loading data, and the data are of the form(first 8 columns are input and the last column is classification) :
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
1 39 1 -1 43 -1 1 0 0.9050497 0
2 23 -1 -1 30 -1 -1 0 1.6624974 1
3 50 -1 -1 49 1 1 2 1.5571429 0
4 46 -1 1 19 -1 -1 0 1.3523685 0
5 36 1 1 29 -1 1 1 1.3812029 1
6 27 -1 -1 19 1 1 0 1.9403649 0
7 36 -1 -1 25 -1 1 0 2.3360004 0
8 41 1 1 23 1 -1 1 2.4899738 0
9 21 -1 -1 18 1 -1 2 1.2989637 1
10 39 -1 1 21 -1 -1 1 1.6121595 0
The number of rows in the data is 500.
As shown in the code above, I tested the first 20 rows for prediction. And the output is:
1 2 3 4 5 6 7
0.04906014 0.88230392 0.04910760 0.04910719 0.87302217 0.04898187 0.04909523
8 9 10 11 12 13 14
0.04909199 0.87224979 0.04913189 0.04893709 0.87812890 0.04909588 0.04910999
15 16 17 18 19 20
0.89837037 0.04903778 0.04914173 0.04897789 0.87572114 0.87001066
I can tell intuitively from the result that when the result is close to 0, it means 0 class, and if it's close to 1 it's in the 1 class.
But my question is how can I precisely interpret the result: is there a threshold s I can use so that values below s are classified as 0 and values above s are classified as 1 ?
If there exists such s, how can I derive it ?
Very broadly speaking with classifiers like this, the predicted value for a binary response variable can be thought of as the probability that that observation belongs to class 1 (in this case your classes are actually labeled 0/1; in other cases you'd need to know which class the function treats as 1 or 0; R often sorts the labels of factors alphabetically and so the last one would be class 1).
So the most common thing people do is use 0.5 as a cutoff. But I should warn you that there is plenty of math behind that decision and the particulars of your modeling circumstances can necessitate a different cutoff value. Using 0.5 as the cutoff is often the best thing to do, but SVMs are fairly complicated beasts; I would recommend that you do some reading on SVMs and classification theory in general before you start trying to apply them to real data.
My favorite reference is The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman.
Since your outcome variable is numeric, it uses the regression formulation of SVM. I think you want the classification formulation. You can change this by either coercing your outcome into a factor, or setting
type="C-classification"
.Regression:
Classification:
Also, if you want probabilities as your prediction rather than just the raw classification, you can do that by fitting with the probability option.
With Probabilities: