There are many posts about 3D reconstruction from stereo views of known internal calibration, some of which are excellent. I have read a lot of them, and based on what I have read I am trying to compute my own 3D scene reconstruction with the below pipeline / algorithm. I'll set out the method then ask specific questions at the bottom.
0. Calibrate your cameras:
- This means retrieve the camera calibration matrices K1 and K2 for Camera 1 and Camera 2. These are 3x3 matrices encapsulating each camera's internal parameters: focal length, principal point offset / image centre. These don't change, you should only need to do this once, well, for each camera as long as you don't zoom or change the resolution you record in.
- Do this offline. Do not argue.
- I'm using OpenCV's
CalibrateCamera()
and checkerboard routines, but this functionality is also included in the Matlab Camera Calibration toolbox. The OpenCV routines seem to work nicely.
1. Fundamental Matrix F:
- With your cameras now set up as a stereo rig. Determine the fundamental matrix (3x3) of that configuration using point correspondences between the two images/views.
- How you obtain the correspondences is up to you and will depend a lot on the scene itself.
- I am using OpenCV's
findFundamentalMat()
to get F, which provides a number of options method wise (8-point algorithm, RANSAC, LMEDS). - You can test the resulting matrix by plugging it into the defining equation of the Fundamental matrix:
x'Fx = 0
where x' and x are the raw image point correspondences(x, y)
in homogeneous coordinates(x, y, 1)
and one of the three-vectors is transposed so that the multiplication makes sense. The nearer to zero for each correspondence, the better F is obeying it's relation. This is equivalent to checking how well the derived F actually maps from one image plane to another. I get an average deflection of ~2px using the 8-point algorithm.
2. Essential Matrix E:
- Compute the Essential matrix directly from F and the calibration matrices.
- E = K2TFK1
3. Internal Constraint upon E:
- E should obey certain constraints. In particular, if decomposed by SVD into
USV.t
then it's singular values should be =a, a, 0
. The first two diagonal elements of S should be equal, and the third zero. - I was surprised to read here that if this is not true when you test for it, you might choose to fabricate a new Essential matrix from the prior decomposition like so:
E_new = U * diag(1,1,0) * V.t
which is of course guaranteed to obey the constraint. You have essentially set S = (100,010,000) artificially.
4. Full Camera Projection Matrices:
- There are two camera projection matrices P1 and P2. These are 3x4 and obey the
x = PX
relation. Also,P = K[R|t]
and thereforeK_inv.P = [R|t]
(where the camera calibration has been removed). - The first matrix P1 (excluding the calibration matrix K) can be set to
[I|0]
then P2 (excluding K) isR|t
- Compute the Rotation and translation between the two cameras R, t from the decomposition of E. There are two possible ways to calculate R (
U*W*V.t
andU*W.t*V.t
) and two ways to calculate t (±third column of U), which means that there are four combinations ofRt
, only one of which is valid. - Compute all four combinations, and choose the one that geometrically corresponds to the situation where a reconstructed point is in front of both cameras. I actually do this by carrying through and calculating the resulting P2 = [R|t] and triangulating the 3d position of a few correspondences in normalised coordinates to ensure that they have a positive depth (z-coord)
5. Triangulate in 3D
- Finally, combine the recovered 3x4 projection matrices with their respective calibration matrices: P'1 = K1P1 and P'2 = K2P2
- And triangulate the 3-space coordinates of each 2d point correspondence accordingly, for which I am using the LinearLS method from here.
QUESTIONS:
- Are there any howling omissions and/or errors in this method?
- My F matrix is apparently accurate (0.22% deflection in the mapping compared to typical coordinate values), but when testing E against
x'Ex = 0
using normalised image correspondences the typical error in that mapping is >100% of the normalised coordinates themselves. Is testing E againstxEx = 0
valid, and if so where is that jump in error coming from? - The error in my fundamental matrix estimation is significantly worse when using RANSAC than the 8pt algorithm, ±50px in the mapping between x and x'. This deeply concerns me.
- 'Enforcing the internal constraint' still sits very weirdly with me - how can it be valid to just manufacture a new Essential matrix from part of the decomposition of the original?
- Is there a more efficient way of determining which combo of R and t to use than calculating P and triangulating some of the normalised coordinates?
- My final re-projection error is hundreds of pixels in 720p images. Am I likely looking at problems in the calibration, determination of P-matrices or the triangulation?
Using the 8pt algorithm does not exclude using the RANSAC principle. When using the 8pt algorithm directly which points do you use? You have to choose 8 (good) points by yourself.
In theory you can compute a fundamental matrix from any point correspondences and you often get a degenerated fundamental matrix because the linear equations are not independend. Another point is that the 8pt algorithm uses a overdetermined system of linear equations so that one single outlier will destroy the fundamental matrix.
Have you tried to use the RANSAC result? I bet it represents one of the correct solutions for F.
Again, if F is degenerated, x'Fx = 0 can be for every point correspondence.
Another reason for you incorrect E may be the switch of the cameras (K1T * E * K2 instead of K2T * E * K1). Remember to check: x'Ex = 0
It is explained in 'Multiple View Geometry in Computer Vision' from Hartley and Zisserman. As far as I know it has to do with the minimization of the Frobenius norm of F.
You can Google it and there are pdf resources.
No as far as I know.
Your rigid body transformation P2 is incorrect because E is incorrect.