I have tried an experiment where I built a simple Producer/Consumer program. They run in separate threads. The producer generates some data and the consumer picks it up in another thread. The messaging latency I achieved is approximately 100 nano seconds. Can anybody tell me if this is reasonable or are there significantly faster implementations out there?
I'm not using locks ... just simple memory counters. My experiment is described here:
http://tradexoft.wordpress.com/2012/10/22/how-to-move-data-between-threads-in-100-nanoseconds/
Basically the consumer waits on a counter to be incremented and then it calls the handler function. So not much code really. Still I was surprised it took 100ns.
The consumer looks like this:
void operator()()
{
while (true)
{
while (w_cnt==r_cnt) {};
auto rc=process_data(data);
r_cnt++;
if (!rc)
break;
}
}
The producer simply incremnts w_cnt when it has data available.
Is there a faster way?
I imagine your latency is a product of how the operating system schedules context-switching, rather than the spin lock itself, and I doubt you can do much about it.
You can, however, move more data at once by using a ring buffer. If one thread writes and one thread reads, you can implement a ring buffer without locks. Essentially it would be the same spin-lock approach (waiting until tailidx != headidx
), but the producer could pump more than a single value into the buffer before it is switched out to the consumer. That ought to improve your overall latency (but not your single-value latency).
If your threads are executed on different cores, then the fastest way to "send message" from one thread to another is write barrier(sfence).
When you write to some memory location, you actually write to the processors write buffer, not to the main-memory location. Write buffer is periodically flushed to main memory by the processor. Also, write instruction can be delayed when instruction reordering occurs. When actual write to main memory occurs, cache coherency protocol comes into play and "informs" another processor about memory location update. After that, another processor invalidates cache line and another thread will be able to see your changes.
Store barrier force processor to flush write buffer and prohibit instruction reordering and your program will be able to send more messages per second.