In his Blog Herb Sutter writes
[...] because incrementing the smart pointer reference count
can usually be optimized to be the same as an ordinary increment
in an optimized shared_ptr
implementation — just an ordinary increment instruction,
and no fences, in the generated code.
However, the decrement must be an atomic decrement or equivalent,
which generates special processor memory instructions that are more expensive in
themselves, and that on top of that induce memory
fence restrictions on optimizing the surrounding code.
The text is about the implementation of shared_ptr
and I am not sure if his remark applies only on this or is generally the case. From his formulation I gather it is generally.
But when I think about it I can only think of "more expensive decrement" when a if(counter==0)
immediately follows -- which probably is the case with shared_ptr
.
Therefore I wonder if the atomic operation ++counter
is (usually) always faster than --counter
, or just because it is used if(--counter==0)...
with shared_ptr
?
I believe it's referring to the fact that the increment can be "hidden", where the "decrement and check" has to be done as one operation.
I'm not aware of any architecture where --counter
(or counter--
, asssuming we're talking about simple datatypes like int, char, etc) is slower than ++counter
or counter++
.
He discusses this in more detail somewhere, I think in his atomic<> weapons presentation. Basically it's all about where the memory fences are needed in the shared_ptr use case, not any intrinsic property of atomic increments vs decrements.
The reason is that you don't really care about the exact ordering of increments with a ref counted smart pointer as long as you don't miss any but with decrements it is critical that you have memory barriers in place so that on your final decrement which triggers the delete you don't have any possibility of previous memory accesses from another thread to the object owned by the smart pointer being reordered to after the memory is freed.
The issue Sutter is talking about is the fact that the reference count increment doesn't require any follow-up action for correctness. You're taking a non-zero reference count to another non-zero count, so no further action is required. However, the decrement requires follow-up action for correctness. The decrement takes a non-zero reference count to either a non-zero or zero reference count, and if your decrement of the reference count goes to zero, you need to perform an action --- specifically, deallocate the referenced object. This decrement-and-act dynamic requires greater consistency, both at the fence level (so the deallocation doesn't get reordered with some other read/write on another core that the CPU's memory/cache-management logic re-ordered) and at the compiler level (so the compiler doesn't reorder operations around the decrement that might cause reads/writes to be re-ordered around the potential deallocation).
So, for Sutter's described scenario, the difference in cost between increment and decrement isn't in the fundamental operations themselves, it's in the consistency constraints imposed on the actual use of the decrement (specifically, acting on the decrement itself) that don't apply to the increment.