Intel-64 and ia32 atomic operations acquire-releas

2019-07-20 08:59发布

I am investigating Intel CPU atomic features on my Haswell CPU (a 4/8 core 2.3-3.9ghz i7-4790M), and am finding it really hard to construct eg. reliable mutex_lock() and mutex_unlock() operations as suggested by for instance the GCC manual:

6.53 x86-Specific Memory Model Extensions for Transactional Memory

The x86 architecture supports additional memory ordering flags to mark lock critical sections for hardware lock elision. These must be specified in addition to an existing memory model to atomic intrinsics.

 '__ATOMIC_HLE_ACQUIRE'
 Start lock elision on a lock variable.  Memory model must be
 '__ATOMIC_ACQUIRE' or stronger.
 '__ATOMIC_HLE_RELEASE'
 End lock elision on a lock variable.  Memory model must be
 '__ATOMIC_RELEASE' or stronger.

When a lock acquire fails it is required for good performance to abort the transaction quickly. This can be done with a '_mm_pause'

 #include <immintrin.h> // For _mm_pause

 int lockvar;

 /* Acquire lock with lock elision */
 while (__atomic_exchange_n(&lockvar, 1, 
     __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE))
     _mm_pause(); /* Abort failed transaction */
 ...
 /* Free lock with lock elision */
 __atomic_store_n(&lockvar, 0, __ATOMIC_RELEASE|__ATOMIC_HLE_RELEASE);

So, reading that and the Intel Software Developer's Manual Vol.3 section 8.1, "Locked Atomic Operations", particulary section 8.1.4, "Effects of a LOCK Operation on Internal Processor Caches", led me to implement my test mutex_lock() mutex_unlock() at first like:

... static inline attribute((always_inline,const)) bool ia64_has_clflush(void) { register unsigned int ebx=0; asm volatile ( "MOV $7, %%eax\n\t" "MOV $0, %%ecx\n\t" "CPUID\n\t" "MOV %%ebx, %0\n\t" : "=r" (ebx) : : "%eax", "%ecx", "%ebx" ); return ((ebx & (1U<<23)) ? true : false); }

#define _LD_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ST_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ACQ_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_ACQUIRE)
#define _REL_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_RELEASE)

static bool has_clflush=false;
static
void init_has_clflush(void)
{ has_clflush = ia64_has_clflush();
}
static
void init_has_clflush(void) __attribute__((constructor));

static inline __attribute__((always_inline))
void mutex_lock( register _Atomic int *ua )
{ // the SDM states that memory to be used as semaphores
  // should not be in the WB cache memory, but nearest we
  // can get to uncached memory is to explicitly un-cache it:
  if(has_clflush)
    asm volatile
    ( "CLFLUSHOPT (%0)"
      :: "r" (ua)
    );
    // why isn't the cache flush enough?
    else
      asm volatile
      ( "LFENCE" :: );
      register unsigned int x;
      x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
      _mm_pause();
    if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
       :: "r" (ua)
      );
    else
      asm volatile
      ( "SFENCE" :: );
  while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
        case EAGAIN:
         continue;
        default:
         fprintf(stderr,"Unexpected futex error: %d : '%s'.", errno,   
              strerror(errno));
        return;
      }
    }
  }

  static inline __attribute__((always_inline))
  void mutex_unlock( register _Atomic int *ua )
  { if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
      :: "r" (ua)
      );
    else
      asm volatile( "LFENCE" :: );
    register unsigned int x;
    x = __atomic_add_fetch( ua, 1, _REL_SEQ_CST_);
    _mm_pause();
    if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
        :: "r" (ua)
      );
    else
      asm volatile ( "SFENCE" :: );
    if(x == 0)
      while( (1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1,
           nullptr,nullptr,0)) && (errno == EINTR));
  }

Now, what is interesting is that the critical mutex_lock() subtraction and mutex_unlock() addition operations end up as the instructions:

mutex_lock:

# 61 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L7:
    lock xacquire subl  $1, lck(%rip)
    rep nop
    cmpb    $0, has_clflush(%rip)
    je  .L8
#APP
# 72 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2

mutex_unlock:

#APP
# 98 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L24:
    movl    $1, %eax
    lock xacquire xaddl %eax, lck(%rip)
    rep nop
    addl    $1, %eax
    cmpb    $0, has_clflush(%rip)
    je  .L25
#APP
# 109 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP

But this implementation seems to require the LFENCE / SFENCE to function reliably (CLFLUSHOPT is not enough) , otherwise both threads can end up deadlocked in futex() with the lock value being an identical -1 .

I cannot see from reading the intel documentation how it can happen that two threads entering the instruction sequence :

# %rbx == $lck
CLFLUSHOPT (%rbx)
lock xacquire subl  $1, lck(%rip)
rep nop

can both end up with the result '-1' in *lck if *lck was 0 ; surely one thread MUST get -1 and the other -2 ?

But strace says not:

strace: Process 11978 attached with 2 threads
[pid 11979] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 11978] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL^C

this is the deadlock situation. Where did I go wrong ?

Please can any Intel CPU Locking & Caching experts out there explain how two atomic decrements or increments of the same uncached location *lck that both assert the #LOCK bus signal (exclusive bus access) and XACQUIRE can end up getting the same result in *lck?

I thought that was what the #LOCK prefix (and HLE) was meant to prevent ? I have tried NOT using HLE and just __ATOMIC_SEQ_CST for all accesses, (this just adds the LOCK prefix, not XACQUIRE) but it makes no difference - deadlock still results without the {L,S}FENCE-es.

I have read Ulrich Drepper's excellent paper [ Futexes are Tricky ] :http://www.akkadia.org/drepper/futex.pdf , but he presents a mutex implementation that only writes hard-coded constants to the lock memory . I can see why . It is very hard to get a mutex to work reliably with a waiter count or any kind of arithmetic done on the lock value. Has anyone found ways to do reliable locked arithmetic such that the result is suitable for lock / semaphore value on x86_64 Linux ? Most interested in discussing them ...

So after a few blind alleys investigating HLE & CLFLUSH, the ONLY working version of the lock / unlock I've been able to arrive at uses hard coded constants and __atomic_compare_exchange_n - the full source of the test program, which increments a counter (without locking) until + / an exit signal is received, is at:

Working Example: intel_lock3.c

[]:https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa

enum LockStatus
{ LOCKED_ONE_WAITER = -1
, LOCKED_NO_WAITERS = 0
, UNLOCKED=1
};

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic int *ua )
{ register int x;
  int cx;
 lock_superceded:
  x  = __atomic_load_n( ua, _LD_SEQ_CST_ );
  cx = x;
  x = (x == UNLOCKED)
       ? LOCKED_NO_WAITERS
       : LOCKED_ONE_WAITER;
  if (! __atomic_compare_exchange_n
      ( ua, &cx, x, false, _ACQ_SEQ_CST_,  _ACQ_SEQ_CST_) )
    goto lock_superceded;
  if( x == LOCKED_ONE_WAITER )
  { do{
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false;
        case EAGAIN:
          break;
        default:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_CST_);
    } while(x < 0);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic int *ua )
{ register int x;
  int cx;
 unlock_superceded:
  x  = __atomic_load_n( ua, _LD_SEQ_CST_ );
  cx = x;
  x = (x == LOCKED_ONE_WAITER)
       ? LOCKED_NO_WAITERS
       : UNLOCKED;
  if (! __atomic_compare_exchange_n
       ( ua, &cx, x, false, _ACQ_SEQ_CST_,  _ACQ_SEQ_CST_) )
    goto unlock_superceded;
    if(x == LOCKED_NO_WAITERS)
    { while((1 < 
             syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
         ||( UNLOCKED != __atomic_load_n( ua, _LD_SEQ_CST_ ))
         ) // we were a waiter, so wait for locker to unlock !
      { if( errno != 0 )
          switch(errno)
          {case EINTR:
            return false;
           case EAGAIN:
            break;
           default:
            fprintf(stderr,
                  "Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
            return false;
          }
      }
   }
   return true;
 }

 Build & Test (GCC 7.3.1 & 6.4.1 & 5.4.0) used:
 $ gcc -std=gnu11 -march=x86-64 -mtune=native -D_REENTRANT \
   -pthread -Wall -Wextra -O3 -o intel_lock3 intel_lock3.c

 $ ./intel_lock3
 # wait a couple of seconds and press ^C
 ^C59362558

Broken Version Using Arithmetic:

https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws

Compile with eg:

$ gcc -std=gnu11 -march=x86_64 -mtune=native -O3 -Wall -Wextra 
  -o intel_lock2 intel_lock2.c
$ ./intel_lock2
# wait a couple of seconds and press ^C
$ ./intel_lock2
^Cwas locked!
446

It should not be printing "was locked!" and within a couple of seconds should have exceeded a count, printed at the end, of @ 5e8 : 5x10^8 , not 446.

Running with strace shows that two threads are blocking waiting for the lock value of -1 to become 0 :

$ strace -f -e trace=futex ./intel_lock2
strace: Process 14481 attached
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14480] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
                                          unavailable)
[pid 14481] <... futex resumed> )       = 0
[pid 14480] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14480] <... futex resumed> )       = 0
[pid 14481] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
                                          unavailable)
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL^C <unfinished  
...>
[pid 14480] <... futex resumed> )       = ? ERESTARTSYS (To be restarted 
if SA_RESTART is set)
strace: Process 14480 detached
strace: Process 14481 detached
was locked!
7086

$

Normally, the WAIT should be scheduled before the WAKE, but somehow GCC is interpreting the memory ordering semantics to mean that the WAKE is always getting scheduled before any WAIT ; but even if that happens, the code should just get delayed, and should never end up with two threads getting a -1 lck value on entry to futex(...FUTEX_WAIT..).

The almost identical algorithm using arithmetic on the lock value ALWAYS deadlocks when both threads get (-1,-1) - note, a -2 value is never seen by any thread:

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
{ register int x;
  x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
  if( x < 0 )
  { do{
    // here you can put:
    // if( x == -2) { .. NEVER REACHED! }
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false; // interrupted - user wants to exit?
        case EAGAIN:
          break;
        default:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_);
    } while(x < 0);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
{ register int x;
  x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
  if(x == 0) // there was ONE waiter
     while(  (1 < 
             syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0)
             )
           ||(1 < __atomic_load_n(ua, _LD_SEQ_)
             ) // wait for first locker to unlock
           ) 
     { if( errno != 0 )
         switch(errno)
         {case EINTR:
           return false;
          case EAGAIN:
           break;
          default:
           fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
           return false;
         }
       }
     return true;
   }

So, I think if if the arithmetic operations were working as expected, ie. were serialized and atomic, then the above code would not deadlock; the arithmetic should be generating the same numbers as the LockStatus enum values used in the working example.

But something is going wrong with the arithmetic, which now produces the instructions :

mutex_lock:

movl    $-1, %eax
lock xaddl  %eax, (%rdx)

mutex_unlock:

movl    $1, %eax
lock xaddl  %eax, (%rdx)

The code itself inserts no fences, but each __atomic_store_n(ua,...) generates one .

AFAICS, there is no valid schedule of that code that results in both threads getting the same -1 value.

So my conclusion is that use of the intel LOCK prefix on arithmetic instructions is unsafe and introduces buggy behaviour in user-mode Linux x86_64 gcc compiled programs - only writes of constant values from text memory to data memory is atomic and sequentially ordered on Intel Haswell i7-4790M platforms with gcc & Linux, and arithmetic on such platforms cannot be made to be atomic & sequentially ordered by use of any combination of HLE / XACQUIRE, lock prefix, or FENCE instructions.

My hunch is that branch prediction is somehow failing and adding an extra arithmetic operation / failing to perform an arithmetic operation on this platform with the LOCK prefix asserted and multiple threads on different physical cores . Therefore, all arithmetic operations with the LOCK prefix asserted are suspect and should be avoided.

2条回答
混吃等死
2楼-- · 2019-07-20 09:07

The latest example intel_lock2.c program at

: https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws

now works as well as the latest intel_lock3.c program at

: https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa

and there is now a version that keeps an accurate negative waiter count, and which uses locked arithmetic, at:

intel_lock4.c: https://drive.google.com/open?id=1kNOppMtobNHU0lfkfWTh8auXvRcbZfhO

The unlock_mutex() routine, IFF there are waiters, must wait for each existing waiter to unlock, so that when it returns, the mutex is unlocked and there are no waiters. It can either achieve this through spin-locking + sched_yield() waiting for the lock value to become 1, or it can use another futex call. So the original locker, when it enters mutex_unlock(), becomes responsible for ensuring that every existing waiter wakes up and unlocks the mutex.

Previously this answer contained:

But there is still weirdness : if either process is ptrace-ed() by strace or compiled with '-g3' instead of '-O3', it now experiences an 'Inconsistency' - ie. inconsistent critical section modified values. This does not occur if the program is not ptrace-d and compiled with -O3 .

See discussion below. In order for GCC's builtin __atomic* functions to work, GCC's optimization phases must be invoked, with ANY -O$x flag specified during compilation sufficing to enable correct operation of the __atomic* builtins.

Final best version of the mutex_lock() / unlock routines:

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
// lock the mutex value pointed to by 'ua';
// can return false if operation was interrupted ( a signal received ).
{ register int x;
  // lock_again:
  x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
  while( x < 0 )
  { switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false;
        case EAGAIN:
          break;
        default:
          // this has never been observed to happen, but in any 
          // production implementation
          // should be replaced by some kind of 
          // 'throw( exception )' statement:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
// unlock: returns false only if interrupted, else returns true
// only when the mutex pointed to by *ua has been unlocked and 
// has no waiters.
{
#ifdef _WITH_UWAIT_
  static int has_unlock_waiter = 0;
#endif
  register int x;
  x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
  if(x < 1) // there was at least ONE waiter, 
            // so we are the original locker
  { while(1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
    { if( errno != 0 )
        switch(errno)
        {case EINTR:
          return false;
         case EAGAIN:
          break;
         default:
           // never observed to happen - should be a throw()
          fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
          return false;
        }
    }
#ifdef _WITH_UWAIT_
// this is strictly unnecessary, and can be replaced by use of
// sched_yield() (see below), but it
// makes the situation clearer:
// unlock :
    // so we have woken a waiter; wait for that waiter to 
    // actually unlock before returning -
    // by definition, when that waiter enters mutex_unlock() 
    // (AND IT MUST!!), it will not
    // enter the clause containing this code unless there is more than
    // one other waiter., in which case we want to continue until there
    // are no waiters.
    while(1 > (x = __atomic_load_n( ua, _LD_SEQ_ )))
    { __atomic_store_n(&has_unlock_waiter, 1, _ST_SEQ_);
      if( (-1 == 
          syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0)
          ) && (errno == EINTR)
        ) return false;
    }
    if( __atomic_load_n(&has_unlock_waiter, _ST_SEQ_) )
      __atomic_store_n(&has_unlock_waiter, 0, _ST_SEQ_);
#else
// The same result is actually achieved by this loop:
    while(1 > (x = __atomic_load_n(ua, _LD_SEQ_)))
      sched_yield();
#endif
    // we do need to wait for the waiting locker to unlock 
    // before proceeding, else
    // mutex_lock could be reentered with lck < 0 and deadlock 
    // would result.
#ifdef _WITH_UWAIT_
  }else if( (x==1) && __atomic_load_n(&has_unlock_waiter, _ST_SEQ_) )
  { // so we're the waiter that a previous unlock woke up 
    // and is waiting for - it now needs to be woken:
    while(1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
    { if( errno != 0 )
        switch(errno)
        {case EINTR:  // no, we cannot let user try to unlock again, since modification of lock value succeeded.
         case EAGAIN:
          break;
         default:
          fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", errno, strerror(errno));
          return false;
        }
    }
  }
#else
  }
#endif
  return true;
}

Testing:

$ gcc -std=gnu11 -pthread -D_WITH_UWAIT_ -O3 -o il2 il2.c
$ ./il2
^C20906015
$ gcc -std=gnu11 -pthread -O3 -o il2 il2.c
$ ./il2
^C45851541

('^C' means pressing + keys simultaneously).

Now all versions never deadlock and do work with :

$ strace -f -e trace=futex ./{intel_lock2 OR intel_lock3 OR intel_lock4} 

I was trying to strace a '-g' (only) compiled version and got an Inconsistency - this does not happen if ANY '-O' flag also used.

查看更多
乱世女痞
3楼-- · 2019-07-20 09:18

lock subl $1, (%rdi) or lock xaddl %eax, (%rdx) are both 100% atomic in all cases, even if the pointer is misaligned (but much slower in that case), and are full memory barriers. On cacheable memory, there won't be any external #LOCK bus signal; the internal implementation just locks the cache line in M state of MESI inside a core that's running the locked instruction. See Can num++ be atomic for 'int num'? for more details.

If your test is finding it isn't atomic, your hardware is broken or your test is broken. Finding a deadlock tells you there's a bug in your design, not that your atomic primitive building-blocks aren't atomic. You can very easily test atomic increments by using two threads to increment a shared counter, and notice that no counts are lost. Unlike if you used addl $1, shared(%rip) without lock, where you would see lost counts.

Also, lfence, sfence, and pause have no effect on correctness in the normal case (no NT stores, and using only WB (Write-Back) memory). If any of your fence / clflush stuff is helping, it's only by adding an extra delay somewhere that is maybe making that thread always lose a race in your test, not actually making it safe. mfence is the only fence that matters, blocking StoreLoad reordering and store-forwarding effects. (Which is why gcc uses it as part of implementing a seq-cst store).

Get a basic version working right before you even think about messing around with HLE / transactional memory.


Race condition in your first version of acquiring the lock

x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_); is atomic, and only one thread's lock sub can change ua from 0 to -1 and get x=-1 from there.

But you aren't using the sub_fetch result, you're doing another load with
while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)

So another thread can see ua=-1 if the first thread locks and then unlocks between the lock sub and the load in that 2nd thread.

The reason it's called sub_fetch is that it atomically returns the old value, as well as atomically modifying the value in memory. The fact that you discard the sub_fetch result is why it can compile to lock sub at all, instead of lock xadd with a register holding -1.

(Or a smart compiler could compile it to lock sub and check ZF, because you can tell when the value became non-zero or negative from flags set by lock sub.)


See C & low-level semaphore implementation for a simple semaphore with no fallback to OS-assisted sleep/wake. It spins on a load until we see a value greater than 0, then attempts to take the lock with C11 fetch_add(-1).

But if it loses the race to another thread, it undoes the decrement.

This is probably a poor design; it's probably best to attempt the decrement with a lock cmpxchg, so threads that fail won't have to undo their decrement.


I haven't used HLE, but I assume this bug is what breaks your HLE locking as well.

You don't need SFENCE, LFENCE, or CLFLUSH[OPT] or anything. lock xadd is already a full memory barrier and 100% atomic on its own, on any memory type (including WB).

You probably misread the SDM if you thought it said you should avoid WB memory for mutexes / semaphores.


You also have a race window during wakeup that can lead to deadlock

This code in mutex_lock looks broken / race-prone

x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);  // ok, fine
_mm_pause();   // you don't want a pause on the fast path.

if( x < 0 )   // just make this a while(x<0) loop
do {
   futex(..., FUTEX_WAIT, ...);

   x = __atomic_load_n(ua,_LD_SEQ_CST_);        // races with lock sub in other threads.
} while(x < 0);

Given thread A sleeping in futex with lck == -1 (if that's possible?):

  • thread B unlocks, resulting in lck == 0, and calls futex(FUTEX_WAKE)
  • thread A wakes up, futex returns while lck is still 0
  • some other thread (B or a 3rd thread) enters mutex_lock and runs __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);, leaving lck == -1
  • thread A runs x = __atomic_load_n(ua,_LD_SEQ_CST_); at the bottom of its loop and sees -1

Now you have 2 threads stuck in the futex wait loop, and no thread actually got the mutex / entered the critical section.


I think your design is broken if it depends on doing a load after futex returns

The example in the futex(2) man page of fwait() shows it returning after futex returns, without loading again.

futex() is an atomic compare-and-block operation. Your design changes your counter value to -1 if one thread is waiting for the lock while a third thread tries to acquire it. So possibly your design is ok for 2 threads, but not for 3.

It's probably a good idea to use an atomic CAS for the decrement, so you never actually change lck to -1 or lower, and futex can stay blocked.

Then if you can count on it to only ever wake 1, then can you also trust its return value to mean you really have the lock without the race-prone separate load. I think.

查看更多
登录 后发表回答