Summary: I had expected that std::atomic<int*>::load
with std::memory_order_relaxed
would be close to the performance of just loading a pointer directly, at least when the loaded value rarely changes. I saw far worse performance for the atomic load than a normal load on Visual Studio C++ 2012, so I decided to investigate. It turns out that the atomic load is implemented as a compare-and-swap loop, which I suspect is not the fastest possible implementation.
Question: Is there some reason that std::atomic<int*>::load
needs to do a compare-and-swap loop?
Background: I believe that MSVC++ 2012 is doing a compare-and-swap loop on atomic load of a pointer based on this test program:
#include <atomic>
#include <iostream>
template<class T>
__declspec(noinline) T loadRelaxed(const std::atomic<T>& t) {
return t.load(std::memory_order_relaxed);
}
int main() {
int i = 42;
char c = 42;
std::atomic<int*> ptr(&i);
std::atomic<int> integer;
std::atomic<char> character;
std::cout
<< *loadRelaxed(ptr) << ' '
<< loadRelaxed(integer) << ' '
<< loadRelaxed(character) << std::endl;
return 0;
}
I'm using a __declspec(noinline)
function in order to isolate the assembly instructions related to the atomic load. I made a new MSVC++ 2012 project, added an x64 platform, selected the release configuration, ran the program in the debugger and looked at the disassembly. Turns out that both std::atomic<char>
and std::atomic<int>
parameters end up giving the same call to loadRelaxed<int>
- this must be something the optimizer did. Here is the disassembly of the two loadRelaxed instantiations that get called:
loadRelaxed<int * __ptr64>
000000013F4B1790 prefetchw [rcx]
000000013F4B1793 mov rax,qword ptr [rcx]
000000013F4B1796 mov rdx,rax
000000013F4B1799 lock cmpxchg qword ptr [rcx],rdx
000000013F4B179E jne loadRelaxed<int * __ptr64>+6h (013F4B1796h)
loadRelaxed<int>
000000013F3F1940 prefetchw [rcx]
000000013F3F1943 mov eax,dword ptr [rcx]
000000013F3F1945 mov edx,eax
000000013F3F1947 lock cmpxchg dword ptr [rcx],edx
000000013F3F194B jne loadRelaxed<int>+5h (013F3F1945h)
The instruction lock cmpxchg
is atomic compare-and-swap and we see here that the code for atomically loading a char
, an int
or an int*
is a compare-and-swap loop. I also built this code for 32-bit x86 and that implementation is still based on lock cmpxchg
.
Question: Is there some reason that std::atomic<int*>::load
needs to do a compare-and-swap loop?