I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
diff(x,y,k) = I(x,y,k) - I(x,y,k-1)
.As your cars are moving in each frame you will get their position..Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
Sure. Say
I
andK
are two video frames andH
is the homography mapping features inI
to features inK
. First you warpI
ontoK
according toH
, i.e. you compute the warped imageIw
asIw( [x y]' )=I( inv(H)[x y]' )
(roughly Matlab notation). Then you look at the squared or absolute difference imageDiff=(Iw-K)*(Iw-K)
. Image content that moves according to the homographyH
should give small differences (assuming constant illumination and exposure between the images). Image content that violatesH
such as moving cars should stand out.For clustering high-error pixel groups in
Diff
I would start with simple thresholding ("every pixel difference inDiff
larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
D
instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.D
with sufficiently high value.Some additional remarks: