Threadsafe lazy initialization: static vs std::cal

For threadsafe lazy initialization, should one prefer a static variable inside a function, std::call_once, or explicit double checked locking? Are there any meaningful differences?

All three can be seen in this question.

Double-Checked Lock Singleton in C++11

Two versions of double checked locking in C++11 turn up in Google.

Anthony Williams shows both double checked locking with explicit memory ordering and std::call_once. He doesn't mention static but that article might have been written before C++11 compilers were available.

Jeff Preshing, in an extensive writeup, describes several variations of double checked locking. He does mention using a static variable as an option and he even shows that compilers will generate code for double checked locking to initialize a static variable. It's not clear to me if he concludes that one way is better than the other.

I get the sense that both articles are meant to be pedagogical and that there's no reason to do this. The compiler will do it for you if you use a static variable or std::call_once.

GCC uses platform specific tricks to avoid atomic operations entirely on the fast path, leveraging the fact that it can do analysis of static better than call_once or double-checking.

Because double-checking uses atomics as its method of avoiding race cases, it has to pay the price of an acquire every time. It's not a high price, but it's a price.

It has to pay this because atomics have to remain atomic in all cases, even difficult operations like compare-exchange. This makes it very hard to optimize out. Generally speaking, the compiler has to leave it in, just in case you use the variable for more than just a double-lock. It has no easy way of proving that you never use one of the more complicated operations on your atomic.

On the other hand, static is highly specialized, and part of the language. It was designed, from the start, to be very easy to provably initialize. Accordingly, the compiler can take shortcuts that were not available to the more generic version. The compiler actually emits the following code for a static:

a simple function:

void foo() {
    static X x;
}

is rewritten inside GCC to:

void foo() {
    static X x;
    static guard x_is_initialized;
    if ( __cxa_guard_acquire(x_is_initialized) ) {
        X::X();
        x_is_initialized = true;
        __cxa_guard_release(x_is_initialized);
    }
}

Which looks a lot like a double-checked lock. However, the compiler gets to cheat a little here. It knows the user can never write use a cxa_guard directly. It knows that it is only used in the special circumstances where the compiler chooses to use it. Thus, with that extra information, it can save some time. The CXA guard specifications, as distributed as they are, all share a common rule: __cxa_guard_acquire will never modify the first byte of the guard, and __cxa_guard__release will set it to non-zero.

This means each guard has to be monotonic, and it specifies exactly what operations will do so. Accordingly it can take advantage of existing race-case protections within the host platform. On x86, for instance, the LL/SS protection guaranteed by the strongly synchronized CPUs turns out to be enough to do this acquire/release pattern, so it can do a raw read of that first byte when it does its double locking, rather than an acquire-read. This is only possible because GCC isn't using the C++ atomic API to do its double locking -- it is using a platform specific approach.

GCC cannot optimize out the atomic in the general case. On architectures which are designed to be less synchronized (such as those designed for 1024+ cores), GCC doesn't get to rely on the archetecture to do LL/SS for it. Thus GCC is forced to actually emit the atomic. However, on common platforms such as x86 and x64, it can be faster.

call_once can have the efficiency of GCC's statics, because it similarly limits the number of operations which can be done to a once_flag to a fraction of the functions that can be applied to an atomic. The tradeoff is that statics are far more convenient to use, when they are applicable, but call_once works in many cases where statics are insufficient (such as a once_flag owned by a dynamically generated object).

There is a slight difference in performance between static and call_once on these higher platforms. Many of these platforms, while not offering LL/SS, will at least offer non-tearing reads of an integer. These platforms can use this, and a thread-specific-pointer, to do per-thread epoch counting to avoid atomics. This is sufficient for static or call_once, but depends on the counter not rolling over. If you do not have a tearing-free 64-bit integer, call_once has to worry about rollover. The implementation may or may not worry about this. If it ignores this issue, it can be as fast as statics. If it pays attention to that issue, it has to be as slow as atomics. Static knows at compile time how many static variables/blocks there are, so it can prove there is no rollover at compile time (or at least be darn confident!)