I was thinking about how to implement semaphores (not binary) using less asm code as possible.
I haven't succeeded in thinking and writing it without using a mutex, so here's the best I could do till now:
Global:
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
typedef struct
{
atomic_ullong value;
pthread_mutex_t *lock_op;
bool ready;
} semaphore_t;
typedef struct
{
atomic_ullong value;
pthread_mutex_t lock_op;
bool ready;
} static_semaphore_t;
/* use with static_semaphore_t */
#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}
Functions:
bool semaphore_init(semaphore_t *semaphore, unsigned long long init_value)
{
if(semaphore->ready) if(!(semaphore->lock_op = \
calloc(1, sizeof(pthread_mutex_t)))) return false;
else pthread_mutex_destroy(semaphore->lock_op);
if(pthread_mutex_init(semaphore->lock_op, NULL))
return false;
semaphore->value = init_value;
semaphore->ready = true;
return true;
}
bool semaphore_wait(semaphore_t *semaphore)
{
if(!semaphore->ready) return false;
pthread_mutex_lock(&(semaphore->lock_op));
while(!semaphore->value) __asm__ __volatile__("nop");
(semaphore->value)--;
pthread_mutex_unlock(&(semaphore->lock_op));
return true;
}
bool semaphore_post(semaphore_t *semaphore)
{
if(!semaphore->ready) return false;
atomic_fetch_add(&(semaphore->value), (unsigned long long) 1);
return true;
}
Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly (ex. lock cmpxchg
)?
Looking at the sem_t struct from <bits/sempahore.h>
included by <semaphore.h>
it seems to me that it has been chosen a very different path...
typedef union
{
char __size[__SIZEOF_SEM_T];
long int __align;
} sem_t;
UPDATE:
@PeterCordes has proposed a definitely much better solution, using the atomics, without a mutex, doing the checks directly on the semaphore value.
I still want to understand better the chances to improve the code in terms of performance taking advantages of the built-in pauses functions or kernel calls that avoid CPU waste, waiting the critical resources to be available.
It also would be nice to have a standard implementation of mutexes and non binary semaphores for comparison.
From futex(7) I read: "The Linux kernel provides futexes ("Fast user-space mutexes") as a building block for fast user-space locking and semaphores. Futexes are very basic and lend themselves well for building higher-level locking abstractions such as mutexes, condition variables, read-write locks, barriers, and semaphores."
See part way down for my minimal naive semaphore implementation which probably works. It compiles and looks right for x86. I think it's correct in general for any C11 implementation.
IIRC, it's possible to implement a counting lock (aka semaphore) with just a single integer, which you access with atomic operations. That wikipedia link even gives algorithms for up
/down
. You don't need a separate mutex. If atomic_ullong
needs a mutex to support atomic increment/decrement on the target CPU, it will include one. (This may be the case on 32bit x86, or the implementation uses a slow cmpxchg8
instead of a fast lock xadd
. Is a 32bit counter really too small for your semaphore? Because 64bit atomics will be slower on 32bit machines.)
The <bits/sempahore.h>
union definition is clearly just an opaque POD type with the correct size, not indicative of the actual implementation.
As @David Schwartz says, it's a fool's errand to implement your own locking for practical use, unless you're an expert. It might be an interesting way to learn about atomic operations and find out what's under the hood in the standard implementations, though. Note carefully his caution that locking implementations are hard to test. You might write code that works for your test-case on your hardware with code from the current version of your compiler with your chosen compile options...
The ready
boolean is just a total waste of space. If you can correctly initialize the ready
flag so that it's meaningful for functions to look at it, then you can initialize the other fields to a sane initial state.
Your code has a few other problems I noticed:
#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true};
static_semaphore_t my_lock = SEMAPHORE_INITIALIZER(1);
// expands to my_lock = = {1, PTHREAD_MUTEX_INITIALIZER, true};;
// you should leave out the = and ; in the macro def so it works like a value
Using a dynamically-allocated pthread_mutex_t *lock_op
is just silly. Use value, not a pointer. Most of your locking functions use the mutex, so the extra level of indirection just slows things down. It would be much better for the memory to just be there along with the counter. A mutex doesn't need a lot of space.
while(!semaphore->value) __asm__ __volatile__("nop");
We want this loop to avoid wasting power and slowing down other threads and, even other logical threads sharing the same core with hyperthreading.
nop
doesn't make a busy-wait loop less CPU-intensive. Even with hyperthreading, it probably makes no difference on x86, because the entire loop body still probably fits in 4 uops, and so issues at one iteration per clock whether there's a nop
in there or not. The nop
doesn't need an execution unit, so at least it doesn't hurt. This spin-loop happens with a mutex held, which seems silly. So the first waiter will get to this spin loop, while waiters after that will spin on the mutex.
Here's my naive implementation of a semaphore, using only C11 atomic ops
I think this is a good implementation that achieves its very-limited goals of being correct and small (source code and machine code), and not using other actual locking primitives. There are major areas that I don't even attempt to address (e.g. fairness/starvation, yielding the CPU to other threads, probably other stuff).
See the asm output on godbolt: only 12 x86 insns for down
, 2 for up
(including the ret
s). Godbolt's non-x86 compilers (gcc 4.8 for ARM/ARM64/PPC) are too old to support C11 <stdatomic.h>
. (They do have C++ std::atomic
, though). So I unfortunately can't easily check the asm output on non-x86.
#include <stdatomic.h>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
typedef struct {
atomic_int val; // int is plenty big. Must be signed, so an extra decrement doesn't make 0 wrap to >= 1
} naive_sem_t;
#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h>
static inline void spinloop_body(void) { _mm_pause(); } // PAUSE is "rep nop" in asm output
#else
static inline void spinloop_body(void) { }
#endif
void sem_down(naive_sem_t *sem)
{
while (1) {
while (likely(atomic_load_explicit(&(sem->val), memory_order_acquire ) < 1))
spinloop_body(); // wait for a the semaphore to be available
int tmp = atomic_fetch_add_explicit( &(sem->val), -1, memory_order_acq_rel ); // try to take the lock. Might only need mo_acquire
if (likely(tmp >= 1))
break; // we successfully got the lock
else // undo our attempt; another thread's decrement happened first
atomic_fetch_add_explicit( &(sem->val), 1, memory_order_release ); // could be "relaxed", but we're still busy-waiting and want other thread to see this ASAP
}
}
// note the release, not seq_cst. Use a stronger ordering parameter if you want it to be a full barrier.
void sem_up(naive_sem_t *sem) {
atomic_fetch_add_explicit(&(sem->val), 1, memory_order_release);
}
The trick here is that it's ok for val
to temporarily be too low; that just makes other threads spin. Also note that fetch_add
being a single atomic operation is key. It returns the old value, so we can detect when val
was taken by another thread between the while loop's load and the fetch_add. (Note that we don't need to check that tmp
is == to the while loop's load: it's fine if another thread up
ed the semaphore between the load and the fetch_add. This is a benefit to using fetch_add instead of cmpxchg).
The atomic_load
spin loop is just a performance optimization over having all the waiters doing atomic read-modify-writes on val
. (Although with many waiters trying to dec and then undo with inc, having a waiter ever see the lock unlocked could be very rare).
A real implementation would have special stuff for more platforms than just x86. For x86, probably more than just a PAUSE
instruction inside the spinloop. This is still just a toy example of a fully-portable C11 implementation. PAUSE
apparently helps avoid mis-speculation on memory ordering so the CPU runs more efficiently after leaving the spin loop. pause
is not the same as yielding the logical CPU to the OS for a different thread to run. It also has nothing to do with correctness and choice of memory_order_???
parameters.
A real implementation would probably give up the CPU to the OS after some number of iterations of spinning (sched_yield(2)
, or more likely a futex
system call, see below). Maybe use x86 MONITOR
/ MWAIT
to be even more hyperthreading-friendly; I'm not sure. I haven't ever implemented locking myself for real, I just see all this stuff in the x86 insn reference while looking up other insns.
As mentioned previously, x86's lock xadd
instruction implements fetch_add
(with sequential-consistency semantics, since lock
ed instructions are always a full memory barrier). On non-x86, using only acquire+release semantics for the fetch_add, not full sequential consistency might possibly allow more efficient code. I'm not sure, but using just acquire
would quite likely allow more efficient code on ARM64. I think we do only need acquire
on the fetch_add, not acq_rel, but I'm not sure. On x86 there won't be any difference in code, since lock
ed instructions are the only way to do atomic read-modify-write, so even relaxed
will be the same as seq_cst
(except for compile-time reordering.)
If you want to yield the CPU instead of spinning, you need a system call (as people have said). Obviously a lot of work has gone into making the standard library locking as efficient as possible on Linux. There are dedicated system calls for helping the kernel wake the right thread(s) when a lock is released, and they are not simple to use. From futex(7)
:
NOTES
To reiterate, bare futexes are not intended as an easy-to-use abstraction for end-users. (There is no wrapper function for this
system call
in glibc.) Implementors are expected to be assembly literate and to have read the sources of the futex user-space library
referenced below.
fairness / starvation (which my naive implementation ignores)
As the wikipedia article mentions, some kind of wakeup queue is a good idea, so the same thread doesn't keep getting the semaphore every time. (Code that takes a lock quickly after releasing it would usually have the releasing thread get the lock while other threads are still asleep).
This is another major benefit to kernel cooperation in the process (futex
).