I am investigating Intel CPU atomic features on my Haswell CPU (a 4/8 core 2.3-3.9ghz i7-4790M), and am finding it really hard to construct eg. reliable mutex_lock() and mutex_unlock() operations as suggested by for instance the GCC manual:
6.53 x86-Specific Memory Model Extensions for Transactional Memory
The x86 architecture supports additional memory ordering flags to mark lock critical sections for hardware lock elision. These must be specified in addition to an existing memory model to atomic intrinsics.
'__ATOMIC_HLE_ACQUIRE'
Start lock elision on a lock variable. Memory model must be
'__ATOMIC_ACQUIRE' or stronger.
'__ATOMIC_HLE_RELEASE'
End lock elision on a lock variable. Memory model must be
'__ATOMIC_RELEASE' or stronger.
When a lock acquire fails it is required for good performance to abort the transaction quickly. This can be done with a '_mm_pause'
#include <immintrin.h> // For _mm_pause
int lockvar;
/* Acquire lock with lock elision */
while (__atomic_exchange_n(&lockvar, 1,
__ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE))
_mm_pause(); /* Abort failed transaction */
...
/* Free lock with lock elision */
__atomic_store_n(&lockvar, 0, __ATOMIC_RELEASE|__ATOMIC_HLE_RELEASE);
So, reading that and the Intel Software Developer's Manual Vol.3 section 8.1, "Locked Atomic Operations", particulary section 8.1.4, "Effects of a LOCK Operation on Internal Processor Caches", led me to implement my test mutex_lock() mutex_unlock() at first like:
...
static inline attribute((always_inline,const))
bool ia64_has_clflush(void)
{ register unsigned int
ebx=0;
asm volatile
( "MOV $7, %%eax\n\t"
"MOV $0, %%ecx\n\t"
"CPUID\n\t"
"MOV %%ebx, %0\n\t"
: "=r" (ebx) :
: "%eax", "%ecx", "%ebx"
);
return ((ebx & (1U<<23)) ? true : false);
}
#define _LD_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ST_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ACQ_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_ACQUIRE)
#define _REL_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_RELEASE)
static bool has_clflush=false;
static
void init_has_clflush(void)
{ has_clflush = ia64_has_clflush();
}
static
void init_has_clflush(void) __attribute__((constructor));
static inline __attribute__((always_inline))
void mutex_lock( register _Atomic int *ua )
{ // the SDM states that memory to be used as semaphores
// should not be in the WB cache memory, but nearest we
// can get to uncached memory is to explicitly un-cache it:
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
// why isn't the cache flush enough?
else
asm volatile
( "LFENCE" :: );
register unsigned int x;
x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
_mm_pause();
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile
( "SFENCE" :: );
while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
case EAGAIN:
continue;
default:
fprintf(stderr,"Unexpected futex error: %d : '%s'.", errno,
strerror(errno));
return;
}
}
}
static inline __attribute__((always_inline))
void mutex_unlock( register _Atomic int *ua )
{ if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile( "LFENCE" :: );
register unsigned int x;
x = __atomic_add_fetch( ua, 1, _REL_SEQ_CST_);
_mm_pause();
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile ( "SFENCE" :: );
if(x == 0)
while( (1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1,
nullptr,nullptr,0)) && (errno == EINTR));
}
Now, what is interesting is that the critical mutex_lock() subtraction and mutex_unlock() addition operations end up as the instructions:
mutex_lock:
# 61 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L7:
lock xacquire subl $1, lck(%rip)
rep nop
cmpb $0, has_clflush(%rip)
je .L8
#APP
# 72 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
mutex_unlock:
#APP
# 98 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L24:
movl $1, %eax
lock xacquire xaddl %eax, lck(%rip)
rep nop
addl $1, %eax
cmpb $0, has_clflush(%rip)
je .L25
#APP
# 109 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
But this implementation seems to require the LFENCE / SFENCE to function reliably (CLFLUSHOPT is not enough) , otherwise both threads can end up deadlocked in futex() with the lock value being an identical -1 .
I cannot see from reading the intel documentation how it can happen that two threads entering the instruction sequence :
# %rbx == $lck
CLFLUSHOPT (%rbx)
lock xacquire subl $1, lck(%rip)
rep nop
can both end up with the result '-1' in *lck if *lck was 0 ; surely one thread MUST get -1 and the other -2 ?
But strace says not:
strace: Process 11978 attached with 2 threads
[pid 11979] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 11978] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL^C
this is the deadlock situation. Where did I go wrong ?
Please can any Intel CPU Locking & Caching experts out there explain how two atomic decrements or increments of the same uncached location *lck that both assert the #LOCK bus signal (exclusive bus access) and XACQUIRE can end up getting the same result in *lck?
I thought that was what the #LOCK prefix (and HLE) was meant to prevent ? I have tried NOT using HLE and just __ATOMIC_SEQ_CST for all accesses, (this just adds the LOCK prefix, not XACQUIRE) but it makes no difference - deadlock still results without the {L,S}FENCE-es.
I have read Ulrich Drepper's excellent paper [ Futexes are Tricky ] :http://www.akkadia.org/drepper/futex.pdf , but he presents a mutex implementation that only writes hard-coded constants to the lock memory . I can see why . It is very hard to get a mutex to work reliably with a waiter count or any kind of arithmetic done on the lock value. Has anyone found ways to do reliable locked arithmetic such that the result is suitable for lock / semaphore value on x86_64 Linux ? Most interested in discussing them ...
So after a few blind alleys investigating HLE & CLFLUSH, the ONLY working version of the lock / unlock I've been able to arrive at uses hard coded constants and __atomic_compare_exchange_n - the full source of the test program, which increments a counter (without locking) until + / an exit signal is received, is at:
Working Example: intel_lock3.c
[]:https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa
enum LockStatus
{ LOCKED_ONE_WAITER = -1
, LOCKED_NO_WAITERS = 0
, UNLOCKED=1
};
static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic int *ua )
{ register int x;
int cx;
lock_superceded:
x = __atomic_load_n( ua, _LD_SEQ_CST_ );
cx = x;
x = (x == UNLOCKED)
? LOCKED_NO_WAITERS
: LOCKED_ONE_WAITER;
if (! __atomic_compare_exchange_n
( ua, &cx, x, false, _ACQ_SEQ_CST_, _ACQ_SEQ_CST_) )
goto lock_superceded;
if( x == LOCKED_ONE_WAITER )
{ do{
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
x = __atomic_load_n(ua,_LD_SEQ_CST_);
} while(x < 0);
}
return true;
}
static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic int *ua )
{ register int x;
int cx;
unlock_superceded:
x = __atomic_load_n( ua, _LD_SEQ_CST_ );
cx = x;
x = (x == LOCKED_ONE_WAITER)
? LOCKED_NO_WAITERS
: UNLOCKED;
if (! __atomic_compare_exchange_n
( ua, &cx, x, false, _ACQ_SEQ_CST_, _ACQ_SEQ_CST_) )
goto unlock_superceded;
if(x == LOCKED_NO_WAITERS)
{ while((1 <
syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
||( UNLOCKED != __atomic_load_n( ua, _LD_SEQ_CST_ ))
) // we were a waiter, so wait for locker to unlock !
{ if( errno != 0 )
switch(errno)
{case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,
"Unexpected futex WAKE error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
}
return true;
}
Build & Test (GCC 7.3.1 & 6.4.1 & 5.4.0) used:
$ gcc -std=gnu11 -march=x86-64 -mtune=native -D_REENTRANT \
-pthread -Wall -Wextra -O3 -o intel_lock3 intel_lock3.c
$ ./intel_lock3
# wait a couple of seconds and press ^C
^C59362558
Broken Version Using Arithmetic:
https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws
Compile with eg:
$ gcc -std=gnu11 -march=x86_64 -mtune=native -O3 -Wall -Wextra
-o intel_lock2 intel_lock2.c
$ ./intel_lock2
# wait a couple of seconds and press ^C
$ ./intel_lock2
^Cwas locked!
446
It should not be printing "was locked!" and within a couple of seconds should have exceeded a count, printed at the end, of @ 5e8 : 5x10^8 , not 446.
Running with strace shows that two threads are blocking waiting for the lock value of -1 to become 0 :
$ strace -f -e trace=futex ./intel_lock2
strace: Process 14481 attached
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14480] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 14481] <... futex resumed> ) = 0
[pid 14480] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14480] <... futex resumed> ) = 0
[pid 14481] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL^C <unfinished
...>
[pid 14480] <... futex resumed> ) = ? ERESTARTSYS (To be restarted
if SA_RESTART is set)
strace: Process 14480 detached
strace: Process 14481 detached
was locked!
7086
$
Normally, the WAIT should be scheduled before the WAKE, but somehow GCC is interpreting the memory ordering semantics to mean that the WAKE is always getting scheduled before any WAIT ; but even if that happens, the code should just get delayed, and should never end up with two threads getting a -1 lck value on entry to futex(...FUTEX_WAIT..).
The almost identical algorithm using arithmetic on the lock value ALWAYS deadlocks when both threads get (-1,-1) - note, a -2 value is never seen by any thread:
static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
{ register int x;
x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
if( x < 0 )
{ do{
// here you can put:
// if( x == -2) { .. NEVER REACHED! }
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
return false; // interrupted - user wants to exit?
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
x = __atomic_load_n(ua,_LD_SEQ_);
} while(x < 0);
}
return true;
}
static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
{ register int x;
x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
if(x == 0) // there was ONE waiter
while( (1 <
syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0)
)
||(1 < __atomic_load_n(ua, _LD_SEQ_)
) // wait for first locker to unlock
)
{ if( errno != 0 )
switch(errno)
{case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
return true;
}
So, I think if if the arithmetic operations were working as expected, ie. were serialized and atomic, then the above code would not deadlock; the arithmetic should be generating the same numbers as the LockStatus enum values used in the working example.
But something is going wrong with the arithmetic, which now produces the instructions :
mutex_lock:
movl $-1, %eax
lock xaddl %eax, (%rdx)
mutex_unlock:
movl $1, %eax
lock xaddl %eax, (%rdx)
The code itself inserts no fences, but each __atomic_store_n(ua,...) generates one .
AFAICS, there is no valid schedule of that code that results in both threads getting the same -1 value.
So my conclusion is that use of the intel LOCK prefix on arithmetic instructions is unsafe and introduces buggy behaviour in user-mode Linux x86_64 gcc compiled programs - only writes of constant values from text memory to data memory is atomic and sequentially ordered on Intel Haswell i7-4790M platforms with gcc & Linux, and arithmetic on such platforms cannot be made to be atomic & sequentially ordered by use of any combination of HLE / XACQUIRE, lock prefix, or FENCE instructions.
My hunch is that branch prediction is somehow failing and adding an extra arithmetic operation / failing to perform an arithmetic operation on this platform with the LOCK prefix asserted and multiple threads on different physical cores . Therefore, all arithmetic operations with the LOCK prefix asserted are suspect and should be avoided.
The latest example intel_lock2.c program at
: https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws
now works as well as the latest intel_lock3.c program at
: https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa
and there is now a version that keeps an accurate negative waiter count, and which uses locked arithmetic, at:
intel_lock4.c: https://drive.google.com/open?id=1kNOppMtobNHU0lfkfWTh8auXvRcbZfhO
The unlock_mutex() routine, IFF there are waiters, must wait for each existing waiter to unlock, so that when it returns, the mutex is unlocked and there are no waiters. It can either achieve this through spin-locking + sched_yield() waiting for the lock value to become 1, or it can use another futex call. So the original locker, when it enters mutex_unlock(), becomes responsible for ensuring that every existing waiter wakes up and unlocks the mutex.
Previously this answer contained:
See discussion below. In order for GCC's builtin
__atomic*
functions to work, GCC's optimization phases must be invoked, with ANY-O$x
flag specified during compilation sufficing to enable correct operation of the__atomic*
builtins.Final best version of the mutex_lock() / unlock routines:
Testing:
('^C' means pressing + keys simultaneously).
Now all versions never deadlock and do work with :
I was trying to strace a '-g' (only) compiled version and got an Inconsistency - this does not happen if ANY '-O' flag also used.
lock subl $1, (%rdi)
orlock xaddl %eax, (%rdx)
are both 100% atomic in all cases, even if the pointer is misaligned (but much slower in that case), and are full memory barriers. On cacheable memory, there won't be any external#LOCK
bus signal; the internal implementation just locks the cache line in M state of MESI inside a core that's running thelock
ed instruction. See Can num++ be atomic for 'int num'? for more details.If your test is finding it isn't atomic, your hardware is broken or your test is broken. Finding a deadlock tells you there's a bug in your design, not that your atomic primitive building-blocks aren't atomic. You can very easily test atomic increments by using two threads to increment a shared counter, and notice that no counts are lost. Unlike if you used
addl $1, shared(%rip)
withoutlock
, where you would see lost counts.Also,
lfence
,sfence
, andpause
have no effect on correctness in the normal case (no NT stores, and using only WB (Write-Back) memory). If any of your fence / clflush stuff is helping, it's only by adding an extra delay somewhere that is maybe making that thread always lose a race in your test, not actually making it safe.mfence
is the only fence that matters, blocking StoreLoad reordering and store-forwarding effects. (Which is why gcc uses it as part of implementing a seq-cst store).Get a basic version working right before you even think about messing around with HLE / transactional memory.
Race condition in your first version of acquiring the lock
x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
is atomic, and only one thread'slock sub
can changeua
from0
to-1
and getx=-1
from there.But you aren't using the
sub_fetch
result, you're doing another load withwhile((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)
So another thread can see
ua=-1
if the first thread locks and then unlocks between thelock sub
and the load in that 2nd thread.The reason it's called
sub_fetch
is that it atomically returns the old value, as well as atomically modifying the value in memory. The fact that you discard thesub_fetch
result is why it can compile tolock sub
at all, instead oflock xadd
with a register holding-1
.(Or a smart compiler could compile it to
lock sub
and check ZF, because you can tell when the value became non-zero or negative from flags set bylock sub
.)See C & low-level semaphore implementation for a simple semaphore with no fallback to OS-assisted sleep/wake. It spins on a load until we see a value greater than 0, then attempts to take the lock with C11
fetch_add(-1)
.But if it loses the race to another thread, it undoes the decrement.
This is probably a poor design; it's probably best to attempt the decrement with a
lock cmpxchg
, so threads that fail won't have to undo their decrement.I haven't used HLE, but I assume this bug is what breaks your HLE locking as well.
You don't need SFENCE, LFENCE, or CLFLUSH[OPT] or anything.
lock xadd
is already a full memory barrier and 100% atomic on its own, on any memory type (including WB).You probably misread the SDM if you thought it said you should avoid WB memory for mutexes / semaphores.
You also have a race window during wakeup that can lead to deadlock
This code in
mutex_lock
looks broken / race-proneGiven thread A sleeping in
futex
withlck == -1
(if that's possible?):lck == 0
, and calls futex(FUTEX_WAKE)lck
is still 0mutex_lock
and runs__atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
, leavinglck == -1
x = __atomic_load_n(ua,_LD_SEQ_CST_);
at the bottom of its loop and sees-1
Now you have 2 threads stuck in the futex wait loop, and no thread actually got the mutex / entered the critical section.
I think your design is broken if it depends on doing a load after futex returns
The example in the
futex(2)
man page offwait()
shows it returning afterfutex
returns, without loading again.futex()
is an atomic compare-and-block operation. Your design changes your counter value to-1
if one thread is waiting for the lock while a third thread tries to acquire it. So possibly your design is ok for 2 threads, but not for 3.It's probably a good idea to use an atomic CAS for the decrement, so you never actually change
lck
to-1
or lower, andfutex
can stay blocked.Then if you can count on it to only ever wake 1, then can you also trust its return value to mean you really have the lock without the race-prone separate load. I think.