In general, most widely known implementations of reference-counting smart ptr classes in C++, including the standard std::shared_ptr
, use atomic reference counting, but do not provide atomic access to the same smart ptr instance. In other words, multiple threads may safely operate on separate shared_ptr
instances which point to the same shared object, but multiple threads cannot safely read/write instances of the same shared_ptr
instance without providing some kind of synchronization such as a mutex or whatever.
An atomic version of a shared_ptr
called "atomic_shared_ptr
" has been proposed, and preliminary implementations already exist. Presumably, atomic_shared_ptr
could easily be implemented with a spin lock or mutex, but a lock-free implementation is also possible.
After studying some of these implementations, one thing is obvious: implementing a lock-free std::shared_ptr
is very difficult, and seems to require so many compare_and_exchange
operations as to make me question whether a simple spin lock would achieve better performance.
The main reason that it's so difficult to implement a lock-free reference-counted pointer is because of the race that always exists between reading the shared control block (or the shared object itself, if we're talking about an intrusive shared pointer), and modifying the reference count.
In other words, you can't even safely read the reference count because you never know when some other thread has deallocated the memory where the reference count lives.
So in general, various complex strategies are employed to create lock-free versions. The implementation here looks like it uses a double-reference count strategy, where there are "local" references which count the number of threads concurrently accessing the shared_ptr
instance, and then "shared" or "global" references which count the number of shared_ptr instances pointing to the shared object.
Given all this complexity, I was really surprised to find a Dr. Dobbs article, from 2004 no less (way before C++11 atomics) that seems to nonchalantly solve this entire problem:
http://www.drdobbs.com/atomic-reference-counting-pointers/184401888
It looks like the author is claiming to somehow be able to:
"... [read] the pointer to the counter, increments the counter, and returns the pointer—all in such a manner that no other threads can cause an incorrect result"
But I don't really understand the way he actually implements this. He's using (non-portable) PowerPC instructions (the LL/SC primitives lwarx
and stwcx
) to pull this off.
The relevant code that does this is what he calls an "aIandF
" (atomic increment and fetch), which he defines as:
addr aIandF(addr r1){
addr tmp;int c;
do{
do{
tmp = *r1;
if(!tmp)break;
c = lwarx(tmp);
}while(tmp != *r1);
}while(tmp && !stwcx(tmp,c+1));
return tmp;
};
Apparently, addr
is a pointer type pointing to the shared object that owns the reference count variable.
My question(s) is:, is this only possible to do with an architecture that supports LL/SC operations? It seems it would be impossible to do with cmpxchg
. And secondly, how exactly does this work? I've read over this code a few times now, and I can't really understand what's going on. I understand what LL/SC primitives do, I just can't make any sense of the code.
The best I can understand is that addr r1
is the address of the pointer to the shared object, and also the address of the pointer to the reference count (which I guess means the reference count variable needs to be the first member of the struct
that defines the shared object). He then dereferences addr
(getting the actual address of the shared object). Then, he linked loads the value stored at the address in tmp
, and stores the result in c
. This is the counter value. He then conditionally stores that value incremented (which will fail if tmp
has changed) back into tmp
.
What I don't understand is how this works. The address of the shared object may never change and the LL/SC could succeed - but how does this help us if another thread has deallocated the shared object in the mean time?