SVM Classification - minimum number of input sets

2019-02-07 09:52发布

问题:

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.

From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.

So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.

Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.

So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?

Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )


Thanks for the reply. I want to check whether I`m deriving the C values for ad and non-ad class correctly or not. Please give me feedback on this.

Or you u can see the doc version here.

You can see graph of y1 eqaul to y2 here

and y1 not equal to y2 here

回答1:

There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.

The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the -wi flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.



回答2:

The required size of your training set depends on the sparseness of the feature space. As far as I can see, you are not discussing what image features you have chose to use. Before you can train, you need to to convert each image into a vector of numbers (features) that describe the image, hopefully capturing the aspects that you care about.

Oh, and unless you are reimplementing SVM for sport, I'd recomment just using libsvm,