GLSL per-pixel spinlock using imageAtomicCompSwap

2020-01-29 03:08发布

问题:

OpenGL red book version 9 (OpenGL 4.5) example 11.13 is Simple Per-Pixel Mutex. It uses imageAtomicCompSwap in a do {} while() loop to take a per-pixel lock to prevent simultaneous access to a shared resouce between pixel shader invocations corresponding to the same pixel coordinate.

layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;

void main(void)
{
    ivec2 pos = ivec2(gl_FragCoord.xy);

    // spinlock - acquire
    uint lock_available;
    do {
        lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
    } while (lock_available != 0);

    // do some operations protected by the lock
    do_something();

    // spinlock - release
    imageStore(lock_image, pos, uvec4(0));
}

This example results in APPCRASH on both Nvidia and AMD GPUs. I know on these two platforms PS vocations are unable to progress indepenently of each other - a sub-group of threads is executed in lockstep, sharing the control flow (a "warp" of 32 threads in Nvidia's terminology). So it may result in deadlock.

However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.". As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations? Does this mean Nvidia and AMD GPU not conform with OpenGL spec?

回答1:

As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations?

If you are using atomic operations to lock access to a pixel, you are relying on one aspect of relative order: that all threads will eventually make forward progress. That is, you assume that any thread spinning on a lock will not starve the thread that has the lock of its execution resources. That threads holding the lock will eventually make forward progress and release it.

But since the relative order of execution is undefined, there is no guarantee of any of that. And therefore, your code cannot work. Any code which relies on any aspect of ordering between the invocations of a single shader stage cannot work (unless there are specific guarantees in place).

This is precisely why ARB_fragment_shader_interlock exists.


That being said, even if there were guarantees of forward progress, your code would still be broken.

You use a non-atomic operation to release the lock. You should be using an atomic set operation.

Plus, as others have pointed out, you need to continue to spin if the return value from the atomic compare/swap is not zero. Remember: all atomic functions return the original value from the image. So if the original value it atomically read is not 0, then it compared false and you don't have the lock.

Now, your code will still be UB by the spec. But it's more likely to work.



回答2:

However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.".

You say this as if the wording of the GL spec would not cover the "lockstep" situation. But "The relative order of invocations of the same shader type are undefined." actually covers that. Given two shader invocations A and B, this statement means that you must not assume any of the following:

  • that A is executed before B
  • that B is executed before A
  • that A and B are executed in parallel
  • that A and B are not executed in parallel
  • that parts of A are executed before the same or other parts of B
  • that parts of B are exectued before the same or other parts of A
  • that parts of A and B are executed in parallel
  • that parts of A and B are not executed in parallel
  • ... (probably a lot more) ...

The undefined order means you can never wait on the results of another invocation because there is no guarantee that this result of the other invocation can be exectued before the wait, except in situations where the GL spec makes certain extra guarantees, i.e:

  • when using explicit synchronization mechanisms like barrier()
  • there are some weak ordering guarantees between different shader stages (I.e. it is allowed to assume that all vertex shader invoations have already happened when processing a fragment for that very primitive.)

For example, the GLSL Spec, Version 4.60 explains the concept of "invocation groups" in section 8.18:

Implementations of the OpenGL Shading Language may optionally group multiple shader invocations for a single shader stage into a single SIMD invocation group, where invocations are assigned to groups in an undefined implementation-dependent manner.

and the accompanying GL 4.6 core profie spec defines "invocation groups" in section 7.9 as

An invocation group [...] for a compute shader is the set of invocations in a single work group. For graphics shaders, an invocation group is an implementation-dependent subset of the set of shader invocations of a given shader stage which are produced by a single drawing command. For MultiDraw* commands with drawcount greater than one, invocations from separate draws are in distinct invocation groups.

So besides for compute shaders, the GL gives you only draw-call-granularity other the invocation groups. This section of the spec also has a following footnote to make this absolutely clear:

Because the partitioning of invocations into invocation groups is implementation-dependent and not observable, applications generally need to assume the worst case of all invocations in a draw belong to a single invocation group.

So besides that stronger statement about undefined relative invocation order, the spec also covers the "in-lockstep" SIMD processsing, and makes it very clear that you have not much control about it in the graphics pipeline.



回答3:

If the execution order is the problem, reordering the code a bit might solve the problem:

layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;

void main(void)
{
    ivec2 pos = ivec2(gl_FragCoord.xy);

    // spinlock - acquire
    uint lock_available;
    do {
        lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);

        if (lock_available == 0)
        {
            // do some operations protected by the lock
            do_something();

            // spinlock - release
            imageAtomicExchange(lock_image, pos, 0);
        }

    } while (lock_available != 0);
}