Why multi-producer queue in Boost atomic usage is

The Wait-free multi-producer queue in Boost Atomic example:

template<typename T>
class waitfree_queue {
public:
  struct node {
    T data;
    node * next;
  };
  void push(const T &data)
  {
    node * n = new node;
    n->data = data;
    node * stale_head = head_.load(boost::memory_order_relaxed);
    do {
      n->next = stale_head;
    } while (!head_.compare_exchange_weak(stale_head, n, boost::memory_order_release));
  }

  node * pop_all(void)
  {
    T * last = pop_all_reverse(), * first = 0;
    while(last) {
      T * tmp = last;
      last = last->next;
      tmp->next = first;
      first = tmp;
    }
    return first;
  }

  waitfree_queue() : head_(0) {}

  // alternative interface if ordering is of no importance
  node * pop_all_reverse(void)
  {
    return head_.exchange(0, boost::memory_order_consume);
  }
private:
  boost::atomic<node *> head_;
};

http://www.boost.org/doc/libs/1_63_0_b1/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.mp_queue

But I found that the code in push is lock-free rather than wait-free. Suppose multiple producers are calling push, at least one producer can make progress; other producers just run the while loop again until making the progress. There exists a scheduling way that starves a particular thread for an unpredictable time.

The definition of wait-free tells us that any given thread provided with a time-slice will be able to make some progress and eventually complete, while lock-free tells us at least one thread can make progress. So the code above seems to satisfy the definition of lock-free.

Are there mistakes in my understanding?

Yes, your analysis looks correct for the abstract C++ model.

Push is lock-free but not wait-free. The CAS-retry loop is on head_ which other threads can continue to modify while we're trying, so any given thread can retry an unbounded number of times. So it's not wait-free.

But also, at least one thread will make progress, and there's no point where a thread can sleep and block all other threads, so it is lock-free.

pop_all_reverse (and thus pop_all) are wait-free. They only do an unconditional atomic exchange which (assuming some hardware fairness...) should be wait-free.

If implemented on real hardware as an LL/SC retry loop, it also becomes only technically lock-free and not guaranteed wait-free. But I think HW can be designed to promote successful SC by a core that got a chance to do an LL, to avoid the possibility of a core temporarily getting a cache line in Exclusive state but not managing to get its atomic operation done before losing ownership. IDK if that's typical or not. In the worst case, I think that could even create livelock where no threads make progress.

More normally, exchange always succeeds the first time it executes, but has to wait for ownership of the cache line before it can do it.

The same will often be true of the CAS. I'd expect actual retries to be rare even in high-contention cases. The CAS in the first trip through the loop will already be decoded and waiting to execute, just waiting for the first load as input. If other threads are writing a cache line, it will be unreadable until it arrives, and may arrive in Exclusive state if the CPU notices the CAS waiting and sends an RFO (Read For Ownership) after sending a regular read request.

Or maybe some CPUs aren't that smart; if the line arrives in Shared state, then CAS would have to wait for a response to an RFO, and that would give a big window for another core to get the line and modify it between the first load and the first CAS.

But after the first CAS, the load result comes from the previous CAS, so it will definitely be reading data from a cache line the core has exclusive ownership of, and another CAS can run right away and succeed.

So in practice there's probably not a huge difference between exchange vs. a CAS retry loop, even on x86 or other ISAs where xchg or swp has true hardware support to run without a retry loop. But the result is maybe better described as lock-free than wait-free, because only one thread can make progress at once in modifying head_ even with exchange. The possible wait time scales with the number of other threads (and the fairness of the atomic operation).

So definitions start to feel a little fuzzy when you look at code compiled for real hardware.