The more I read the more confused I become... I would have thought it trivial to find a formally correct mpsc queue implemented in c++.
Every time I find another stab at it, further research seems to suggest there are issues such as ABA or other subtle race conditions.
Many talk of the need for garbage collection. This is something I want to avoid.
Is there an accepted correct open source implementation out there?
The most suitable implementation depends on the desired properties of a queue. Should it be unbounded or a bounded one is fine? Should it be linearizable, or less strict requirements would be fine? How strong FIFO guarantees you need? Are you willing to pay the cost of reverting the list by the consumer (there exists a very simple implementation where the consumer grabs the tail of a single-linked list, thus getting at once all items put by producers till the moment)? Should it guarantee that no thread is ever blocked, or tiny chances to get some thread blocked are ok? And etc.
Some useful links:
Is multiple-producer, single-consumer possible in a lockfree setting?
http://www.1024cores.net/home/lock-free-algorithms/queues
http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
https://groups.google.com/group/comp.programming.threads/browse_frm/thread/33f79c75146582f3
Hope that helps.
You may want to check disruptor; it's available in C++ here: http://lmax-exchange.github.io/disruptor/
You can also find explanation how it works here on stackoverflow Basically it's circular buffer with no locking, optimized for passing FIFO messages between threads in a fixed-size slots.
Here are two implementations which I found useful: Lock-free Multi-producer Multi-consumer Queue on Ring Buffer @ NatSys Lab. Blog and
Yet another implementation of a lock-free circular array queue @ CodeProject
NOTE: the code below is incorrect, I leave it only as an example how tricky these things can be.
If you don't like the complexity of google version, here is something similar from me - it's much simpler, but I leave it as an exercise to the reader to make it work (it's part of larger project, not portable at the moment). The whole idea is to maintain cirtular buffer for data and a small set of counters to identify slots for writing/written and reading/read. Since each counter is in its own cache line, and (normally) each is only atomically updated once in the live of a message, they can all be read without any synchronisation. There is one potential contention point between writing threads in
post_done
, it's required for FIFO guarantee. Counters (head_, wrtn_, rdng_, tail_) were selected to ensure correctness and FIFO, so dropping FIFO would also require change of counters (and that might be difficult to do without sacrifying correctness). It is possible to slightly improve performance for scenarios with one consumer, but I would not bother - you would have to undo it if other use cases with multiple readers are found.On my machine latency looks like following (percentile on left, mean within this percentile on right, unit is microsecond, measured by rdtsc):
These results are for single polling consumer, i.e. worker thread calling wheel.read() in tight loop and checking if not empty (scroll to bottom for example). Waiting consumers (much lower CPU utilization) would wait on event (one of
acquire...
functions), this adds about 1-2us to average latency due to context switch.Since there is verly little contention on read, consumers scale very well with number of worker threads, e.g. for 3 threads on my machine:
Patches will be welcome :)
Here is what polling consumer worker thread may look like:
Edited added consumer example
Below is the technique I used for my Cooperative Multi-tasking / Multi-threading library (MACE) http://bytemaster.github.com/mace/. It has the benefit of being lock-free except for when the queue is empty.
I'm guessing no such thing exists - and if it does, it either isn't portable or isn't open source.
Conceptually, you are trying to control two pointers simultaneously: the
tail
pointer and thetail->next
pointer. That can't generally be done with just lock-free primitives.