LibSVM turns all my training vectors into support

2019-05-01 10:25发布

问题:

I am trying to use SVM for News article classification.

I created a table that contains the features (unique words found in the documents) as rows. I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.

Ex:- Training sample generated...

1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 17:1 18:1 19:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1

As this is the first document all the features are present.

I am using 1, 0 as class labels.

I am using svm.Net for classification.

I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.

My total features (unique words/row count in feature vector DB table) is 7610.

What could be the reason?

Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.

In LibSVM binary classification is there any restriction on the class label?

I am using 0, 1 instead of -1 and +1. Is that a problem?

回答1:

As pointed out, a parameter search is probably a good idea before doing anything else.

I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)

For more information and perhaps better answers, look on stats.stackexchange.com.



回答2:

You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it



回答3:

I would definitely try using -1 and +1 for your labels, that's the standard way to do it.

Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.

With that many features, you might want to try some type of feature selection method like principle component analysis.