Relative performance of swap vs compare-and-swap l

2020-02-04 06:58发布

问题:

Two common locking idioms are:

if (!atomic_swap(lockaddr, 1)) /* got the lock */

and:

if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */

where val could simply be a constant or an identifier for the new prospective owner of the lock.

What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

回答1:

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].

Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.

But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter. Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.



回答2:

I found this Intel document, stating that there is no difference in practice:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/

One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.



回答3:

On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.

Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.



回答4:

Are you sure you didn't mean

 if (!atomic_load(lockaddr)) {
       if (!atomic_swap(lockaddr, val)) /* got the lock */

for the second one?

Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.

Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.

As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.

Figure 8 from the document that Bo Persson found http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.



回答5:

In terms of performance on Intel's processors it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg for acquiring a lock if you can do this with xchg.

According to the Occam's razor principle, simple things are better.

Besides that, locking with xchg is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking, or that you don't unlock twice.

There is no consensus on whether releasing a lock should be just a normal store or a lock-ed store. For example, LeaveCriticalSection under Windows 10 uses lock-ed store to release lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), the issue of how to release lock: a normal store vs a lock-ed store may be even more important.

See this example of safer locking functions that check data for validity, and catches attempts to release locks that were not aquired:

const
  cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
  cLockLocked    = 109;
  cLockFinished  = 113;

function AcquireLock(var Target: LONG): Boolean; 
var
  R: LONG;
begin
  R := InterlockedExchange(Target, cLockByteLocked);
  case R of
    cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
    cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
      else
        begin
          raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
        end;
    end;
end;

procedure ReleaseLock(var Target: LONG);
var
  R: LONG;
begin
  // As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
  R := Target;
  Target := cLockAvailable;
  if R <> cLockByteLocked  then
  begin
    raise Exception.Create('Serious application error - tried to release a  lock that has not been actually locked');
  end;
end;

Your main application goes here:

var
  AreaLocked: LONG;
begin
  AreaLocked := cLockAvailable; // on program initialization, fill the default value

  .... 

 if AcquireLock(AreaLocked) then
 try
   // do something critical with the locked area
   ... 

 finally
   ReleaseLock(AreaLocked); 
 end;

....

  AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock

end.

You may also use the following code as a spin-loop, it uses normal load while spinning to save resources, as suggested by Peter Cordes. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is my empirical. Values from 500 to 50000 also seem to be OK, in some scenarios lower values are better while in others higher are better. Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause instruction - otherwise there will just be a waste of power. On processors without pause just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause, are very slim.

// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
@Init:
   mov  edx, cLockByteLocked
   mov  r9d, 5000
   mov  eax, edx
   jmp  @FirstCompare
@DidntLock:
@NormalLoadLoop:
   dec  r9
   jz   @SwitchToThread // for static branch prediction, jump forward means "unlikely"
   pause
@FirstCompare:
   cmp  [rcx], al       // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
   je   @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
   lock xchg [rcx], al
   cmp  eax, edx        // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
   je   @DidntLock
   jmp  @Finish
@SwitchToThread:
   push  rcx
   call  SwitchToThreadIfSupported
   pop   rcx
   jmp  @Init
@Finish: