If linear interpolation happens during the rasterization stage in the OpenGL pipeline, and the vertices have already been transformed to screen-space, where does the depth information used for perspectively correct interpolation come from?
Can anybody give a detailed description of how OpenGL goes from screen-space primitives to fragments with correctly interpolated values?
The output of a vertex shader is a four component vector, vec4 gl_Position
. From Section 13.6 Coordinate Transformations of core GL 4.4 spec:
Clip coordinates for a vertex result from shader execution, which yields a vertex
coordinate gl_Position
.
Perspective division on clip coordinates yields normalized device coordinates,
followed by a viewport transformation (see section 13.6.1) to convert these coordinates into window coordinates.
Consider a typical projection matrix with left, top, right, bottom and near clipping planes of ±1. It looks like:
[ 1 0 0 0 ]
[ 0 1 0 0 ]
P = [ * * * * ]
[ 0 0 -1 0 ]
We ignore the third row because it is used only for computing the depth value for the z-buffer, and is not relevant for the rest of the discussion.
Given a vertex (x, y, z, 1) in the world space, the vertex shader will pass the value
gl_Position = P * vec4(x, y, z, 1) = vec4(x, y, *, -z)
to the rasterization stage. This is where the perspective information comes from: it's in the last component of gl_Position
! Note that it is not the responsibility of the vertex shader to do the perspective division. If you do, you will get incorrect interpolation.
How correct perspective interpolation is computed?
Let's consider the plane passing through the given triangle. We give it a parametrization (s, t):
x = s*x0 + t*x1 + x2
y = s*y0 + t*y1 + y2
z = s*z0 + t*z1 + z2
Let u = -x/z
, v = -y/z
be the device coordinates.
Substitute the expressions for x,y,z and solve for (s,t).
Denote w = -1/z
and substitute z, s, and t.
If neither of us made a mistake, you should get
(y0*z1 - z0*y1)*u + (z0*x1 - x0*z1)*v + (x0*y1 - y0*x1)
w = -1 * -------------------------------------------------------------- .
(y0*z1 - z0*y1)*x2 + (z0*x1 - x0*z1)*y2 + (x0*y1 - y0*x1)*z2
We got that inverse depth (w) is affine with respect to the device coordinates (u,v).
So we can compute w at the vertices of the triangle, and interpolate it linearly in the interior.
Next, we want to interpolate some arbitrary attribute.
It is obviously enough to compute the (s,t) parametrization for each fragment.
Solving (x,y) for (s,t) we get:
(x - x2)*y1 - x1*(y - y2)
s = ---------------------------
x0*y1 - x1*y0
x0*(y - y2) - (x - x2)*y0
t = ---------------------------
x0*y1 - x1*y0
where (x,y) = (u,v)/w
is immediate from the definitions. Thus we expressed (s,t) as a function of (u,v), with one division per fragment and lots of muls and adds.
Note that this is the theoretic part. The real-life hardware implementations might do lots of approximations, tricks, and other magic.
Putting it all together
For simplicity, assume that viewport transformation is an identity, so that window coordinates coincide with the normalized device coordinates.
Let's consider the rasterization of a single triangle, for which the vertex shader returned the positions P0, P1, P2 (this is their gl_Position
s) and some attributes A0, A1, A2 which we have to correctly interpolate.
Denote Qi = (Pi.x, Pi.y, -Pi.w)
.
Compute
(x0, y0, z0) = Q0 - Q2
(x1, y1, z1) = Q1 - Q2
(x2, y2, z2) = Q2 .
(This is done once per triangle.)
For every vertex, compute
Ui = Pi.x / Pi.w
Vi = Pi.y / Pi.w
This gives its position on the screen.
Rasterize the 2D triangle with vetices (U0, V0), (U1, V1), (U2, V2), producing a set of fragment positions (u,v) which lie within the triangle.
For each fragment (u,v) produced by the above rasterization do the following:
Substitute (u,v) in the formulas in the previous section, giving (s,t) for the fragment.
- (s,t) vary linearily in the world space (with (0,0) at P2, (1,0) at P0, and (0,1) at P1), and perspectively correct in the screen space.
Calculate
A = s*(A0 - A2) + t*(A1 - A2) + A2
Execute the fragment shader with the input A (which is correctly interpolated).
The formula that you will find in the GL specification (look on page 427; the link is the current 4.4 spec, but it has always been that way) for perspective-corrected interpolation of the attribute value in a triangle is:
a * f_a / w_a + b * f_b / w_b + c * f_c / w_c
f=-----------------------------------------------------
a / w_a + b / w_b + c / w_c
where a,b,c
denote the barycentric coordinates of the point in the triangle we are interpolating for (a,b,c >=0, a+b+c = 1
), f_i
the attribute value at vertex i
, and w_i
the clip space w
coordinate of vertex i
. Note that the barycentric coordinates are calculated only for the 2D projection of the window space coords of the triangle (so z is ignored).
This is what the formulas that ybungalowbill gave in his fine answer boils down to, in the general case, with an arbitrary projection axis. Actually, the last row of the projection matrix defines just the projection axis the image plane will be orthogonal to, and the clip space w
component is just the dot product between the vertex coords and that axis.
In the typical case, the projection matrix has (0,0,-1,0) as the last row, so it transfroms so that w_clip = -z_eye
, and this is what ybungalowbill used. However, since w
is what we actually will do the division by (that is the only nonlinear step in the whole transformation chain), this will work for any projection axis. It will also work in the trivial case of orthogonal projections where w
is always 1 (or at least constant).
Note a few things for an efficient implementation of this. The inversion 1/w_i
can be pre-calculated per vertex (let's call them q_i
in the following), it does not have to be re-evaluated per fragment. And it is totally free since we divide by w
anyway, when going into NDC space, so we can save that value. The GL spec does never describe how a certain feature is to be implemented internally, but the fact that the screen space coordinates will be accessible in glFragCoord.xyz
, and gl_FragCoord.w
is guaranteed to give the (lineariliy interpolated) 1/w
clip space coordinate is quite revealing here. That per-fragment 1_w
value is actually the denominator of the formula given above.
The factors a/w_a
, b/w_b
and c/w_c
are each used two times in the formula. And these are also constant for any attribute value, now matter how many attributes there are to be interpolated. So, per fragment, you can calculate a'=q_a * a
, b'=q_b * b
and c'=q_c
and get
a' * f_a + b' * f_b + c' * f_c
f=------------------------------
a' + b' + c'
So the perspective interpolation boils down to
- 3 additional multiplications,
- 2 additional additions, and
- 1 additional division
per fragment.