When it comes to cascade classifiers (using haar like features) I always read that methods like AdaBoosting are used to select the 'best' features for detection. However this only works if there is some initial set of features to begin boosting.
Given a 24x24 pixel image there are 162,336 possible haar features. I might be wrong here, but I don't think libraries like openCV initially test against all of these features.
So my question is how are the initial features selected or how are they generated? Is there any guideline about the initial number of features?
And if all 162,336 features are used initially. How are they generated?
I presume, you're familiar with Viola/Jones' original work on this topic.
You start by manually choosing a feature type (e.g. Rectangle A). This gives you a mask with which you can train your weak classifiers. In order to avoid moving the mask pixel by pixel and retraining (which would take huge amounts of time and not any better accuracy), you can specify how much the feature moves in x and y direction per trained weak classifier. The size of your jumps depend on your data size. The goal is to have the mask be able to move in and out of the detected object. The size of the feature can also be variable.
After you've trained multiple classifiers with a respective feature (i.e. mask position), you proceed with AdaBoost and Cascade training as usual.
The number of features/weak classifiers is highly dependent on your data and experimental setup (i.e. also the type of classifier you use). You'll need to test the parameters extensibly to also know which type of features work best (rectangle/circle/tetris-like objects etc). I worked on this 2 years ago and it took us quite a long time to evaluate which features and feature-generation-heuristics yielded the best results.
If you wanna start somewhere, just take 1 of the 4 original Viola/Jones features and train a classifier applying it anchored to (0,0). Train the next classifier with (x,0). The next with (2x,0)....(0,y), (0,2y), (0,4y),.. (x,y), (x, 2y) etc...
And see what happens. Most likely you'll see that it's ok to have less weak classifiers, i.e. you can proceed to increase the x/y step values which determine how the mask slides. You can also have the mask grow or do other stuff to save time. The reason this "lazy" feature generation works is AdaBoost: as long as these features make the classifiers slightly better than random, AdaBoost will combine these classifiers into a meaningful classifier.
From your question i am able to understand that you wanted to know what are 1,62,336 features.
From 4 original viola jones features(http://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework)
We can generate 1,62,336 features by varying size of 4 original features and their position on 24*24 input image.
For example consider one of the original feature which has two rectangles adjacent to each other.
Let us consider size of each rectangle is 1 pixel. Initially if one rectangle is present on (0,0) of 24*24 image then it is considered as one feature & now if you move it horizontally by one pixel( to (1,0) ) then it is considered as second feature as its position is changed to (1,0). In this way u can move it horizontally upto (22,0) generating 23 features. Similarly, if you move along vertical axis from (0,0) up to (0,23) then u can generate 24 features. Now if you move on image covering every position (for example (1,1),(1,2).....(22,23) ) then u can generate 24*23=552 features.
Now if we consider width of each rectangle is 2 pixels and height is 1 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (20,0) as said above then we can have 21 features, as its height is same if we move along vertical axis from (0,0) to (0,23) we can have 24 features. Thus if we move so as to cover every position on image then we can have 24*21=504 features.
In this way if we increase width of each rectangle by one pixel keeping height of each rectangle as 1 pixel every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 24*(23+21+19.....3+1)
Now, if we consider width of each rectangle is 1 pixel and height as 2 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (23,0) then we can have 23 features as its width is 1 pixel, as its height is 2 pixels if we move along vertical axis from (0,0) to (0,22) then we can have 23 features. Thus if we move so as to cover every position on image then we can have 23*23=529 features.
Similarly, if we increase width of each rectangle by one pixel keeping height of each rectangle as 2 pixels every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 23*(23+21+19.....3+1)
Now, if we increase height of each rectangle by 1 pixel after changing width of each rectangle from 1 pixel to 24 pixels until height of each rectangle becomes 24 pixels, then
no. of features = 24*(23+21+19.....3+1) + 23*(23+21+19.....3+1) + 22*(23+21+19.....3+1) +.................+ 2*(23+21+19.....3+1) + 1*(23+21+19.....3+1)
= 43,200 features
Now if we consider 2nd viola jones original feature which has two rectangles with one rectangle above other(that is rectangles are arranged vertically), as this is similar to 1st viola jones original feature it will also have
no. of features = 43,200
Similarly if we follow above process, from 3rd original viola jones feature which has 3 rectangles arranged along horizontal direction, we get
no. of features = 24*(22+19+16+....+4+1) + 23*(22+19+16+....+4+1) + 22*(22+19+16+....+4+1) +................+ 2*(22+19+16+....+4+1) + 1*(22+19+16+....+4+1)
=27,600
Now, if we consider another feature which has 3 rectangles arranged vertically(that is one rectangle upon another) then we get
no. of features = 27,600 (as it is similar to 3rd original viola jones feature)
Lastly, if we consider 4th original viola jones feature which has 4 rectangles we get
no.of features = 23*(23+21+19+......3+1) + 21*(23+21+19+......3+1) + 19*(23+21+19+......3+1) ..................+ 3*(23+21+19+......3+1) + 1*(23+21+19+......3+1)
= 20,736
Now summing up all these features we get = 43,200 + 43,200 + 27,600 + 27,600 + 20,736
= 1,62,336 features
Thus from above 1,62,336 features Adaboost selects some of them to form strong classifier.
I'd like to point to a recent work, published on february 2015, on this topic. The work is:
César Cobos-May, Víctor Uc-Cetina, Carlos Brito-Loeza and Anabel
Martin-Gonzalez, A Convex Set Based Algorithm to Automatically
Generate Haar-Like Features, Comput. Sci. Appl.Volume 2, Number 2,
pp. 64-70, 2015.
It is presented as "a first step towards the automation of Haar-like features design".
An algorithm is proposed that should be able to "extract" a Haar-like feature given a segmented image representing the object that needs to be detected. The basic idea consists in trying to find the largest area rectangle entirely contained in a convex region.
First, a representative image of the object of interest is selected and processed with the K-means algorithm (with K=2). Then, the resulting segmented image is fed to the algorithm, that finds the largest rectangle for each (some?) of the convex regions in the segmented image, to obtain the final feature.
Actually, quoting the paper:
Later on [after the segmentation], our method to automatically create
the template [the Haar-like feature] was applied to the regions of the
segmented image.
So, if I'm not mistaken, there still has to be some measure of subjective decision regarding which regions of the segmented image are chosen as input of the algorithm.
Note that the features obtained in this way tend to be slightly more complex than the traditional ones.
This methodology is tested in a specific case study against the traditional features, giving seemingly good results.
It seems to me that there is a little bit of confusion here.
Even the accepted answer seems not correct to me (maybe I haven’t got it well).
The original Viola-Jones algorithm, the main later improvements of it as the Lienhart-Maydt algorithm, and the Opencv implementation, all of them evaluate each and every feature of the feature set in turn.
You can check the source code of Opencv (and whatever implementation you prefer).
At the end of function void CvHaarEvaluator::generateFeatures() you have numFeatures, which is just 162,336 for BASIC mode and size 24x24.
And all of them are checked in turn, when all the feature set is provided in the form of featureEvaluator (source):
bool isStageTrained = tempStage->train( (CvFeatureEvaluator*)featureEvaluator, curNumSamples,
_precalcValBufSize, _precalcIdxBufSize, *((CvCascadeBoostParams*)stageParams) );
Every weak classifier is constructed by checking each feature and chosing the one that yields the best result at that point (in case of a decision tree the process is similar).
After this choice, the weights of samples are changed accordingly, so that at the next round a different feature, from all the feature set again, will be selected.
A single feature evalution is computationally cheap, but multiplied by numFeatures can be demanding.
The whole training of a cascade can take weeks, but the bottleneck is not the feature evaluation process, it is the negative sample gathering at latest stages.
From the wikipedia link you provided I read:
in a standard 24x24 pixel sub-window, there are a total of M= 162,336
possible features, and it would be prohibitively expensive to evaluate
them all when testing an image.
Don’t be mislead by this, it means than after the long training process your detection algorithm should be very fast and it only needs to check few features (just the ones selected during training).