Probabilistic Occupancy Map is a multi-camera human detection procedure, with its c++ implementation freely avaible at:
http://cvlab.epfl.ch/software/pom
In order to utilize this handy piece of software one needs:
- A series of synchronized video frames from multiple cameras after background removal procedure.
- A configuration file for a particular scenario.
POM ships with an example set of video frames and related configuration file.
My problem can be stated as follows:
Given a sequence of synchronized videos (for example from http://cvlab.epfl.ch/data/pom) how do I generate the configuration file required by POM? In particular I'm interested in the RECTANGLE tag of the configuration. The readme states:
RECTANGLE [camera number] [location number] notvisible|[xmin] [ymin] [xmax] [ymax]
Defines the parameters of a certain rectangle, standing for an
individual at a certain location viewed from a certain camera. All
non-specified rectangles are "not visible" by default.
From my understanding it defines how a person's bounding rectangle would look like at a certain location viewed from a certain camera. This has to be defined for every (grid) location for every camera (given the location is in the camera's Field of View - if not, notvisible is used or the rectangle may be left undefinied).
Doing this by hand is not impossible, but certainly is impractical. So, to sum up: How do I generate the POM configuration file if I have a set of videos from multiple cameras?
In the associated publication, the authors mention they use the camera calibration to generate the rectangles for a human silhouette in every position in the grid. It seems the code that accomplishes this is not included in the source files, in that case you will have to write it yourself.
In the calibration data for their datasets, you can see that they make use of two homographies per camera, the head plane homography and the ground plane homography. You can use this to quickly obtain the required rectangles.
The head plane homography is a 3x3 matrix that describes a mapping from one plane to another. In this case it describes the mapping from 2D room coordinates (at head level) to 2D image coordinates. You can determine this homography for your own camera with the function findHomography in opencv. All you need to do is measure the coordinates of four points on the ground in the room, and stand an upright pole on those markings. The pole should be as long as the average person you want to track. You can now write a small program that allows you to click on the top of the pole in each camera view. You now have four world points (the coordinates measured in the room) and four image points per camera (the points you clicked). With findHomography you can determine the homography. Do the same for the markings on the ground without the pole, and you have the two homographies per camera.
You can now use the homographies to project the 8 corner points of a rectangle standing on any position in the room onto their image coordinates for each camera. Take the bounding box of all 8 points and you have the rectangle for that room location and that camera.
The authors of the method mentioned using a human silhouette, this indicates that their approach may be more accurate than using a cuboid. However, there is no such thing as the silhouette of a moving person, so the solution with the cuboid is likely to be perfectly workable.
I've been recently reading this article and digging the code so what I understood from the article+code is pretty much what you guys have discussed.
To sum up, for every camera in the system, you have to create rectangles which later will be used by POM as a comparison with the real silhouettes obtained from the background substraction algorithm (assuming you've already obtained those) on every possible grid position. Since every camera may not see every grid position in the scene, you put "notvisible" tag in those cases. As it's already mentioned, you need to use the calibration files to map the sizes of the 175 cm height and 50 cm width according to the perspective. i.e. closer rectangles are supposed to be bigger than the further ones.
RECTANGLE 0 414 150 0 159 119 means; Camera 0 hypothetically sees is a black rectangle on the grid 414 with the size of P1(x,y) = (150,0) - P2(x,y) = (159,119). These measures are obtained by reprojecting 175cm - 50cm from head plane(2D camera plane) to the ground plane(3D plane).
UPDATE: I tried what I posted here and yeah, it works.