SVM classification with always high precision

2019-09-12 06:56发布

问题:

I have a binary classification problem and I'm trying to get precision-recall curve for my classifier. I use libsvm with RBF kernel and probability estimate option.

To get the curve I'm changing decision threshold from 0 to 1 with steps of 0.1. But on every run, I get high precision even if recall decreases with increasing threshold. My false positive rate seems always low compared to true positives.

My results are these:

Threshold: 0.1
TOTAL TP:393, FP:1, FN: 49
Precision:0.997462, Recall: 0.889140

Threshold: 0.2
TOTAL TP:393, FP:5, FN: 70
Precision:0.987437, Recall: 0.848812

Threshold: 0.3
TOTAL TP:354, FP:4, FN: 78
Precision:0.988827, Recall: 0.819444

Threshold: 0.4
TOTAL TP:377, FP:9, FN: 104
Precision:0.976684, Recall: 0.783784

Threshold: 0.5
TOTAL TP:377, FP:5, FN: 120
Precision:0.986911, Recall: 0.758551

Threshold: 0.6
TOTAL TP:340, FP:4, FN: 144
Precision:0.988372, Recall: 0.702479

Threshold: 0.7
TOTAL TP:316, FP:5, FN: 166
Precision:0.984424, Recall: 0.655602

Threshold: 0.8
TOTAL TP:253, FP:2, FN: 227
Precision:0.992157, Recall: 0.527083

Threshold: 0.9
TOTAL TP:167, FP:2, FN: 354
Precision:0.988166, Recall: 0.320537

Does this mean I have a good classifier or I have a fundamental mistake somewhere?

回答1:

One of the reasons for this could be while training the data you have lot of negative samples than positive ones. Hence, almost all the examples are being classified as negative samples except the few. Hence, you get high precision i.e. less false positives and low recall i.e. more false negatives.

Edit:

Now that we know you have more negative samples than positive ones:

If you look at the results, as and when you increase the threshold the number of False negatives are increasing i.e. your positive samples are classified as negative ones, which is not a good thing. Again, it depends on your problem, some problems will prefer high precision over recall, some will prefer high recall over precision. If you want both precision and recall to be high, you might need to resolve class imbalance, by trying oversampling (repeating positive samples so that ratio becomes 1:1) or undersampling (taking random negative samples in proportion with positive samples) or something more sophisticated like SMOTE algorithm (which adds similar positive samples).

Also, I am sure there must be "class_weight" parameter in the classifier, which gives more importance to error in the class where there are less training examples. You might want to try giving more weight to positive class than negative ones.



回答2:

Having a high precision can be that your data has a pattern that your model seems to grasp easily so it's a good classifier.
Maybe your measures are incorrectly computed or the most probable : your model is overfitting. That means that your model is not learning but rather memorizing.
This can be produced by testing your model on your training set.