Lock-free Reference counting and C++ smart pointer

In general, most widely known implementations of reference-counting smart ptr classes in C++, including the standard std::shared_ptr, use atomic reference counting, but do not provide atomic access to the same smart ptr instance. In other words, multiple threads may safely operate on separate shared_ptr instances which point to the same shared object, but multiple threads cannot safely read/write instances of the same shared_ptr instance without providing some kind of synchronization such as a mutex or whatever.

An atomic version of a shared_ptr called "atomic_shared_ptr" has been proposed, and preliminary implementations already exist. Presumably, atomic_shared_ptr could easily be implemented with a spin lock or mutex, but a lock-free implementation is also possible.

After studying some of these implementations, one thing is obvious: implementing a lock-free std::shared_ptr is very difficult, and seems to require so many compare_and_exchange operations as to make me question whether a simple spin lock would achieve better performance.

The main reason that it's so difficult to implement a lock-free reference-counted pointer is because of the race that always exists between reading the shared control block (or the shared object itself, if we're talking about an intrusive shared pointer), and modifying the reference count.

In other words, you can't even safely read the reference count because you never know when some other thread has deallocated the memory where the reference count lives.

So in general, various complex strategies are employed to create lock-free versions. The implementation here looks like it uses a double-reference count strategy, where there are "local" references which count the number of threads concurrently accessing the shared_ptr instance, and then "shared" or "global" references which count the number of shared_ptr instances pointing to the shared object.

Given all this complexity, I was really surprised to find a Dr. Dobbs article, from 2004 no less (way before C++11 atomics) that seems to nonchalantly solve this entire problem:

http://www.drdobbs.com/atomic-reference-counting-pointers/184401888

It looks like the author is claiming to somehow be able to:

"... [read] the pointer to the counter, increments the counter, and returns the pointer—all in such a manner that no other threads can cause an incorrect result"

But I don't really understand the way he actually implements this. He's using (non-portable) PowerPC instructions (the LL/SC primitives lwarx and stwcx) to pull this off.

The relevant code that does this is what he calls an "aIandF" (atomic increment and fetch), which he defines as:

addr aIandF(addr r1){
  addr tmp;int c;
  do{
    do{
      tmp = *r1;
      if(!tmp)break;
      c = lwarx(tmp);
    }while(tmp != *r1);
  }while(tmp && !stwcx(tmp,c+1));
  return tmp;
};

Apparently, addr is a pointer type pointing to the shared object that owns the reference count variable.

My question(s) is:, is this only possible to do with an architecture that supports LL/SC operations? It seems it would be impossible to do with cmpxchg. And secondly, how exactly does this work? I've read over this code a few times now, and I can't really understand what's going on. I understand what LL/SC primitives do, I just can't make any sense of the code.

The best I can understand is that addr r1 is the address of the pointer to the shared object, and also the address of the pointer to the reference count (which I guess means the reference count variable needs to be the first member of the struct that defines the shared object). He then dereferences addr (getting the actual address of the shared object). Then, he linked loads the value stored at the address in tmp, and stores the result in c. This is the counter value. He then conditionally stores that value incremented (which will fail if tmp has changed) back into tmp.

What I don't understand is how this works. The address of the shared object may never change and the LL/SC could succeed - but how does this help us if another thread has deallocated the shared object in the mean time?

addr aIandF(addr r1) {
  addr tmp;
  int c;
  do {
    do {
      // r1 holds the address of the address
      // of the refcount
      tmp = *r1;       // grab the address of the refcount
      if (!tmp) break; // if it's null, bail

      // read current refcount
      // and acquire reservation
      c = lwarx(tmp);

      // now we hold the reservation,
      // check to see if another thread
      // has changed the shared block address
    } while (tmp != *r1); // if so, start over

    // if the store succeeds we know we held
    // the reservation throughout
  } while (tmp && !stwcx(tmp, c+1));
  return tmp;
};

Note that aIandF is used specifically when constructing a copy of an existing shared pointer, claiming a reference for the copy.

The Dr Dobbs article describes the operation for releasing a reference as first atomically swapping the address of the shared counter in the source shared pointer object with a null pointer local to the function; then atomically decrementing the counter; then testing to see if the result of the decrement was zero. This order of operations is important: you say, "The address of the shared object may never change and the LL/SC could succeed - but how does this help us if another thread has deallocated the shared object in the mean time?" - but this can never happen, since the object will never be deallocated without the swap happening first, giving us a means to observe the change of address.

aIandF tests for the counter address being null on entry.

It can spot the address becoming null if that occurs before the lwarx, because it explicitly tests for this once it has the reservation.

If the swap in the decrementing thread occurs after the lwarx we do not actually care: if the stwcx in aIandF succeeds, we know the decrementing thread will see the new reference count and not destruct the object, and we can proceed knowing we have claimed a reference to it; whereas if the other thread succeeds in decrementing the counter first, we will lose our reservation, the store will fail and we'll detect the destruction of the object on the next loop iteration.

This algorithm assumes a strongly consistent memory model (all threads always see effects of each other's reads and writes in program order) - this is not necessarily the case even on those modern architectures that do support ll/sc.

EDIT: thinking about it, the algorithm also apparently makes the assumption that it is always safe to read from a memory address that was once valid (e.g., no MMU/protection; or, the algorithm is broken):

if (!tmp) break;

// another thread could, at this point, do its swap, 
// decrement *and* destroy the object tmp points to
// before we get to do anything else

c = lwarx(tmp); 

// if that happened, we'll detect this fact and do nothing with c
// but ONLY if the lwarx doesn't trap 
// (due to the memory tmp points to 
// getting unmapped when the other thread frees the object)