I am trying to implement the (relatively simple) linear homogeneous (DLT) 3D triangulation method from Hartley & Zisserman's "Multiple View Geometry" (sec 12.2), with the aim of implementing their full, "optimal algorithm" in the future. Right now, based on this question, I'm trying to get it to work in Matlab, and will later port it into C++ and OpenCV, testing for conformity along the way.
The problem is that I'm unsure how to use the data I have. I have calibrated my stereo rig, and obtained the two intrinsic camera matrices, two vectors of distortion coefficients, the rotation matrix and translation vector relating the two cameras, as well as the essential and fundamental matrices. I also have the 2D coordinates of two points that are supposed to be correspondences of a single 3D point in the coordinate systems of the two images (taken by the 1st and 2nd camera respectively).
The algorithm takes as input the two point coordinates and two 4x3 "camera matrices" P and P'. These aren't obviously the intrinsic camera matrices (M, M') obtained from the calibration, because for one they are 3x3, and also because projection using them alone puts a 3D point in two distinct coordinate systems, that is - the extrinsic (rotation/translation) data is missing.
The H&Z book contains information (chapter 9) on recovering the required matrices from either the fundamental or the essential matrix using SVD decomposition, but with additional problems of its own (e.g. scale ambiguity). I feel I don't need that, since I have the rotation and translation explicitly defined.
The question then is: would it be correct to use the first intrinsic matrix, with an extra column of zeros as the first "camera matrix" (P=[M|0]), and then multiply the second intrinsic matrix by a extrinsic matrix composed from the rotation matrix and the translation vector as an extra column to obtain the second required "camera matrix" (P'=M'*[R|t])? Or should it be done differently?
Thanks!
I don't have my H&Z to hand - but their old CVPR tutorial on the subject is here (for anyone else to have a look at w.r.t this question).
Just for clarity (and to use their terminology) the projection matrix P maps from Euclidean 3-space point (X) to an image point (x) as:
where:
defined by the (3x3) camera calibration matrix K and the (3x3) rotation matrix R and translation vector (3x1) t.
The crux of the matter seems to be how to then perform triangulation using your two cameras P and P'.
I believe you are proposing that the world origin is located at a the first camera P, thus:
and
What we then seek for reconstruction in the Fundamental Matrix F such that:
The matrix F can of course be computed any number of ways (sometimes more commonly from uncalibrated images!) but here I think you might want to do it on the basis of your already calibrated camera matrices above as:
Where
C = (0 1)
is the centre of first camera andpinv(P)
is the pseudo-inverse of P. The_x
indicates the notation used in the literature for matrix multiplication to calculate the vector product.You can then perform a factorization of the fundamental matrix F (performed via SVD or direct method).
And hence, as you correctly state, we can then compute triangulation directly based on:
and
Using these to perform triangulation should then be relatively straightforward (assuming good calibration, lack of noise, etc. etc.)