The situation I'll describe is occurring on an iPad 4 (ARMv7s), using posix libs to mutex lock/unlock. I've seen similar things on other ARMv7 devices, though (see below), so I suppose any solution will require a more general look at the behaviour of mutexes and memory fences for ARMv7.
Pseudo code for the scenario:
Thread 1 – Producing Data:
void ProduceFunction() {
MutexLock();
int TempProducerIndex = mSharedProducerIndex; // Take a copy of the int member variable for Producers Index
mSharedArray[TempProducerIndex++] = NewData; // Copy new Data into array at Temp Index
mSharedProducerIndex = TempProducerIndex; // Signal consumer data is ready by assigning new Producer Index to shared variable
MutexUnlock();
}
Thread 2 – Consuming Data:
void ConsumingFunction () {
while (mConsumerIndex != mSharedProducerIndex) {
doWorkOnData (mSharedArray[mConsumerIndex++]);
}
}
Previously (when the problem cropped up on iPad 2), I believed that mSharedProducerIndex = TempProducerIndex
was not being performed atomically, and hence changed to use an AtomicCompareAndSwap
to assign mSharedProducerIndex
. This has worked up until this point, but it turns out I was wrong and the bug has come back. I guess the 'fix' just changed some timing.
I have now come to the conclusion that the actual problem is an out of order execution of the writes within the mutex lock, i.e. if either the compiler or the hardware decided to reorder:
mSharedArray[TempProducerIndex++] = NewData; // Copy new Data into array at Temp Index
mSharedProducerIndex = TempProducerIndex; // Signal consumer data is ready by assigning new Producer Index to shared variable
... to:
mSharedProducerIndex = TempProducerIndex; // Signal consumer data is ready by assigning new Producer Index to shared variable
mSharedArray[TempProducerIndex++] = NewData; // Copy new Data into array at Temp Index
... and then the consumer interleaved the producer, the data would not have yet been written when the consumer tried to read it.
After some reading on memory barriers, I therefore thought I’d try moving the signal to the consumer outside the mutex_unlock
, believing that the unlock would produce a memory barrier/fence which would ensure mSharedArray
had been written to:
mSharedArray[TempProducerIndex++] = NewData; // Copy new Data into array at Temp Index
MutexUnlock();
mSharedProducerIndex = TempProducerIndex; // Signal consumer data is ready by assigning new Producer Index to shared variable
This, however, still fails, and leads me to question if a mutex_unlock
will definitely act as a write fence or not?
I've also read an article from HP which suggested that compilers could move code into (but not out of) crit_sec
s. So even after the above change, the write of mSharedProducerIndex
could be before the barrier. Is there any mileage to this theory?
By adding an explicit fence the problem goes away:
mSharedArray[TempProducerIndex++] = NewData; // Copy new Data into array at Temp Index
OSMemoryBarrier();
mSharedProducerIndex = TempProducerIndex; // Signal consumer data is ready by assigning new Producer Index to shared variable
I therefore think I understand the problem, and that a fence is required, but any insight into the behaviour of the unlock and why it doesn’t appear to be performing a barrier would be really useful.
EDIT:
Regarding the lack of a mutex in the consumer thread: I'm relying on the write of the int mSharedProducerIndex
being a single instruction and therefore hoping the consumer would read either the new or old value. Either are valid states, and providing that mSharedArray
is written in sequence (i.e. prior to writing mSharedProducerIndex
) this would be OK, but from what has been said so far, I can’t reply on this.
By the same logic it appears that the current barrier solution is also flawed, as the mSharedProducerIndex
write could be moved inside the barrier and could therefore potentially be incorrectly re-ordered.
Is it recommended to add a mutex to the consumer, just to act as a read barrier, or is there a pragma
or instruction for disabling out-of-order execution on the producer, like EIEIO
on PPC?
Your produces are sync'ed but you don't do any synchronization (you need to synchronize memory with barriers as well) on consuming. So even if you have perfect memory barriers for producers that memory barriers won't help consumers.
In your code, you can be hit by compiler's ordering, hardware ordering even by a stale value of
mSharedProducerIndex
on other core running Thread #2.You should read
Chapter 11: Memory Ordering
from Cortex™-A Series Programmer’s Guide, especially11.2.1 Memory barrier use example
.I think your problem is you are getting partial updates in consumer thread. Problem is what is inside critical section in producer is not atomic and it can be reordered.
By
not atomic
I mean if yourmSharedArray[TempProducerIndex++] = NewData;
is not a word store (NewData has type of int) it might be done in several steps which can be seen by other core as partial updates.By
reordering
I mean mutex provides barriers in and out but not impose any ordering during critical section. Since you don't have any special construct in consumer side you can seemSharedProducerIndex
is updated but still see partial updates tomSharedArray[mConsumerIndex]
. Mutex only guarantee memory visibility after execution leaves critical section.I believe this also explains why it works when you add
OSMemoryBarrier();
inside critical section, because this way cpu is forced to write data intomSharedArray
then updatemConsumerIndex
and when other core/thread seesmConsumerIndex
we know thatmSharedArray
is copied fully because of the barrier.I think your implementation with
OSMemoryBarrier();
is correct assuming you have many-producers and one-consumer. I disagree with any comments suggesting putting a memory barrier in consumer, since I believe that won't fix partial updates or reordering happening in critical section inside producer.As an answer to your question in title, in general afaik
mutex
es have read barrier before they enter and write barrier after they leave.The "theory" is correct, writes can be moved from after a write fence to before it.
The fundamental problem with your code is that there is no synchronization at all in thread 2. You read
mSharedProducerIndex
without a read barrier, so who knows what value you'll get. Nothing that you do in thread 1 will solve that.