I was thinking about how to implement semaphores (not binary) using less asm code as possible.
I haven't succeeded in thinking and writing it without using a mutex, so here's the best I could do till now:
Global:
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
typedef struct
{
atomic_ullong value;
pthread_mutex_t *lock_op;
bool ready;
} semaphore_t;
typedef struct
{
atomic_ullong value;
pthread_mutex_t lock_op;
bool ready;
} static_semaphore_t;
/* use with static_semaphore_t */
#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}
Functions:
bool semaphore_init(semaphore_t *semaphore, unsigned long long init_value)
{
if(semaphore->ready) if(!(semaphore->lock_op = \
calloc(1, sizeof(pthread_mutex_t)))) return false;
else pthread_mutex_destroy(semaphore->lock_op);
if(pthread_mutex_init(semaphore->lock_op, NULL))
return false;
semaphore->value = init_value;
semaphore->ready = true;
return true;
}
bool semaphore_wait(semaphore_t *semaphore)
{
if(!semaphore->ready) return false;
pthread_mutex_lock(&(semaphore->lock_op));
while(!semaphore->value) __asm__ __volatile__("nop");
(semaphore->value)--;
pthread_mutex_unlock(&(semaphore->lock_op));
return true;
}
bool semaphore_post(semaphore_t *semaphore)
{
if(!semaphore->ready) return false;
atomic_fetch_add(&(semaphore->value), (unsigned long long) 1);
return true;
}
Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly (ex. lock cmpxchg
)?
Looking at the sem_t struct from <bits/sempahore.h>
included by <semaphore.h>
it seems to me that it has been chosen a very different path...
typedef union
{
char __size[__SIZEOF_SEM_T];
long int __align;
} sem_t;
UPDATE:
@PeterCordes has proposed a definitely much better solution, using the atomics, without a mutex, doing the checks directly on the semaphore value.
I still want to understand better the chances to improve the code in terms of performance taking advantages of the built-in pauses functions or kernel calls that avoid CPU waste, waiting the critical resources to be available.
It also would be nice to have a standard implementation of mutexes and non binary semaphores for comparison.
From futex(7) I read: "The Linux kernel provides futexes ("Fast user-space mutexes") as a building block for fast user-space locking and semaphores. Futexes are very basic and lend themselves well for building higher-level locking abstractions such as mutexes, condition variables, read-write locks, barriers, and semaphores."
See part way down for my minimal naive semaphore implementation which probably works. It compiles and looks right for x86. I think it's correct in general for any C11 implementation.
IIRC, it's possible to implement a counting lock (aka semaphore) with just a single integer, which you access with atomic operations. That wikipedia link even gives algorithms for
up
/down
. You don't need a separate mutex. Ifatomic_ullong
needs a mutex to support atomic increment/decrement on the target CPU, it will include one. (This may be the case on 32bit x86, or the implementation uses a slowcmpxchg8
instead of a fastlock xadd
. Is a 32bit counter really too small for your semaphore? Because 64bit atomics will be slower on 32bit machines.)The
<bits/sempahore.h>
union definition is clearly just an opaque POD type with the correct size, not indicative of the actual implementation.As @David Schwartz says, it's a fool's errand to implement your own locking for practical use, unless you're an expert. It might be an interesting way to learn about atomic operations and find out what's under the hood in the standard implementations, though. Note carefully his caution that locking implementations are hard to test. You might write code that works for your test-case on your hardware with code from the current version of your compiler with your chosen compile options...
The
ready
boolean is just a total waste of space. If you can correctly initialize theready
flag so that it's meaningful for functions to look at it, then you can initialize the other fields to a sane initial state.Your code has a few other problems I noticed:
Using a dynamically-allocated
pthread_mutex_t *lock_op
is just silly. Use value, not a pointer. Most of your locking functions use the mutex, so the extra level of indirection just slows things down. It would be much better for the memory to just be there along with the counter. A mutex doesn't need a lot of space.We want this loop to avoid wasting power and slowing down other threads and, even other logical threads sharing the same core with hyperthreading.
nop
doesn't make a busy-wait loop less CPU-intensive. Even with hyperthreading, it probably makes no difference on x86, because the entire loop body still probably fits in 4 uops, and so issues at one iteration per clock whether there's anop
in there or not. Thenop
doesn't need an execution unit, so at least it doesn't hurt. This spin-loop happens with a mutex held, which seems silly. So the first waiter will get to this spin loop, while waiters after that will spin on the mutex.Here's my naive implementation of a semaphore, using only C11 atomic ops
I think this is a good implementation that achieves its very-limited goals of being correct and small (source code and machine code), and not using other actual locking primitives. There are major areas that I don't even attempt to address (e.g. fairness/starvation, yielding the CPU to other threads, probably other stuff).
See the asm output on godbolt: only 12 x86 insns for
down
, 2 forup
(including theret
s). Godbolt's non-x86 compilers (gcc 4.8 for ARM/ARM64/PPC) are too old to support C11<stdatomic.h>
. (They do have C++std::atomic
, though). So I unfortunately can't easily check the asm output on non-x86.The trick here is that it's ok for
val
to temporarily be too low; that just makes other threads spin. Also note thatfetch_add
being a single atomic operation is key. It returns the old value, so we can detect whenval
was taken by another thread between the while loop's load and the fetch_add. (Note that we don't need to check thattmp
is == to the while loop's load: it's fine if another threadup
ed the semaphore between the load and the fetch_add. This is a benefit to using fetch_add instead of cmpxchg).The
atomic_load
spin loop is just a performance optimization over having all the waiters doing atomic read-modify-writes onval
. (Although with many waiters trying to dec and then undo with inc, having a waiter ever see the lock unlocked could be very rare).A real implementation would have special stuff for more platforms than just x86. For x86, probably more than just a
PAUSE
instruction inside the spinloop. This is still just a toy example of a fully-portable C11 implementation.PAUSE
apparently helps avoid mis-speculation on memory ordering so the CPU runs more efficiently after leaving the spin loop.pause
is not the same as yielding the logical CPU to the OS for a different thread to run. It also has nothing to do with correctness and choice ofmemory_order_???
parameters.A real implementation would probably give up the CPU to the OS after some number of iterations of spinning (
sched_yield(2)
, or more likely afutex
system call, see below). Maybe use x86MONITOR
/MWAIT
to be even more hyperthreading-friendly; I'm not sure. I haven't ever implemented locking myself for real, I just see all this stuff in the x86 insn reference while looking up other insns.As mentioned previously, x86's
lock xadd
instruction implementsfetch_add
(with sequential-consistency semantics, sincelock
ed instructions are always a full memory barrier). On non-x86, using only acquire+release semantics for the fetch_add, not full sequential consistency might possibly allow more efficient code. I'm not sure, but using justacquire
would quite likely allow more efficient code on ARM64. I think we do only needacquire
on the fetch_add, not acq_rel, but I'm not sure. On x86 there won't be any difference in code, sincelock
ed instructions are the only way to do atomic read-modify-write, so evenrelaxed
will be the same asseq_cst
(except for compile-time reordering.)If you want to yield the CPU instead of spinning, you need a system call (as people have said). Obviously a lot of work has gone into making the standard library locking as efficient as possible on Linux. There are dedicated system calls for helping the kernel wake the right thread(s) when a lock is released, and they are not simple to use. From
futex(7)
:fairness / starvation (which my naive implementation ignores)
As the wikipedia article mentions, some kind of wakeup queue is a good idea, so the same thread doesn't keep getting the semaphore every time. (Code that takes a lock quickly after releasing it would usually have the releasing thread get the lock while other threads are still asleep).
This is another major benefit to kernel cooperation in the process (
futex
).