C & low-level semaphore implementation

I was thinking about how to implement semaphores (not binary) using less asm code as possible.
I haven't succeeded in thinking and writing it without using a mutex, so here's the best I could do till now:

Global:

#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>

typedef struct
{
    atomic_ullong    value;
    pthread_mutex_t *lock_op;
    bool             ready;
} semaphore_t;

typedef struct
{
    atomic_ullong   value;
    pthread_mutex_t lock_op;
    bool            ready;
} static_semaphore_t;

 /* use with static_semaphore_t */
#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}

Functions:

bool semaphore_init(semaphore_t *semaphore, unsigned long long init_value)
{   
    if(semaphore->ready) if(!(semaphore->lock_op = \
                             calloc(1, sizeof(pthread_mutex_t)))) return false;
    else                 pthread_mutex_destroy(semaphore->lock_op);   

    if(pthread_mutex_init(semaphore->lock_op, NULL))
            return false;

    semaphore->value = init_value;
    semaphore->ready = true;
    return true;
}

bool semaphore_wait(semaphore_t *semaphore)
{
    if(!semaphore->ready) return false;

    pthread_mutex_lock(&(semaphore->lock_op));
    while(!semaphore->value) __asm__ __volatile__("nop");
    (semaphore->value)--;
    pthread_mutex_unlock(&(semaphore->lock_op));
    return true;
}

bool semaphore_post(semaphore_t *semaphore)
{
    if(!semaphore->ready) return false;

    atomic_fetch_add(&(semaphore->value), (unsigned long long) 1);
    return true;
}

Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly (ex. lock cmpxchg)?

Looking at the sem_t struct from <bits/sempahore.h> included by <semaphore.h> it seems to me that it has been chosen a very different path...

typedef union
{
    char __size[__SIZEOF_SEM_T];
    long int __align;
} sem_t;

UPDATE:

@PeterCordes has proposed a definitely much better solution, using the atomics, without a mutex, doing the checks directly on the semaphore value.

I still want to understand better the chances to improve the code in terms of performance taking advantages of the built-in pauses functions or kernel calls that avoid CPU waste, waiting the critical resources to be available.

It also would be nice to have a standard implementation of mutexes and non binary semaphores for comparison.
From futex(7) I read: "The Linux kernel provides futexes ("Fast user-space mutexes") as a building block for fast user-space locking and semaphores. Futexes are very basic and lend themselves well for building higher-level locking abstractions such as mutexes, condition variables, read-write locks, barriers, and semaphores."

标签： c multithreading assembly mutex semaphore

1条回答

家丑人穷心不美

2楼-- · 2019-01-18 15:17

See part way down for my minimal naive semaphore implementation which probably works. It compiles and looks right for x86. I think it's correct in general for any C11 implementation.

IIRC, it's possible to implement a counting lock (aka semaphore) with just a single integer, which you access with atomic operations. That wikipedia link even gives algorithms for up/down. You don't need a separate mutex. If atomic_ullong needs a mutex to support atomic increment/decrement on the target CPU, it will include one. (This may be the case on 32bit x86, or the implementation uses a slow cmpxchg8 instead of a fast lock xadd. Is a 32bit counter really too small for your semaphore? Because 64bit atomics will be slower on 32bit machines.)

The <bits/sempahore.h> union definition is clearly just an opaque POD type with the correct size, not indicative of the actual implementation.

As @David Schwartz says, it's a fool's errand to implement your own locking for practical use, unless you're an expert. It might be an interesting way to learn about atomic operations and find out what's under the hood in the standard implementations, though. Note carefully his caution that locking implementations are hard to test. You might write code that works for your test-case on your hardware with code from the current version of your compiler with your chosen compile options...

The ready boolean is just a total waste of space. If you can correctly initialize the ready flag so that it's meaningful for functions to look at it, then you can initialize the other fields to a sane initial state.

Your code has a few other problems I noticed:

#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true};

static_semaphore_t my_lock = SEMAPHORE_INITIALIZER(1);
// expands to my_lock = = {1, PTHREAD_MUTEX_INITIALIZER, true};;
// you should leave out the = and ; in the macro def so it works like a value

Using a dynamically-allocated pthread_mutex_t *lock_op is just silly. Use value, not a pointer. Most of your locking functions use the mutex, so the extra level of indirection just slows things down. It would be much better for the memory to just be there along with the counter. A mutex doesn't need a lot of space.

while(!semaphore->value) __asm__ __volatile__("nop");

We want this loop to avoid wasting power and slowing down other threads and, even other logical threads sharing the same core with hyperthreading.

nop doesn't make a busy-wait loop less CPU-intensive. Even with hyperthreading, it probably makes no difference on x86, because the entire loop body still probably fits in 4 uops, and so issues at one iteration per clock whether there's a nop in there or not. The nop doesn't need an execution unit, so at least it doesn't hurt. This spin-loop happens with a mutex held, which seems silly. So the first waiter will get to this spin loop, while waiters after that will spin on the mutex.

Here's my naive implementation of a semaphore, using only C11 atomic ops

I think this is a good implementation that achieves its very-limited goals of being correct and small (source code and machine code), and not using other actual locking primitives. There are major areas that I don't even attempt to address (e.g. fairness/starvation, yielding the CPU to other threads, probably other stuff).

See the asm output on godbolt: only 12 x86 insns for down, 2 for up (including the rets). Godbolt's non-x86 compilers (gcc 4.8 for ARM/ARM64/PPC) are too old to support C11 <stdatomic.h>. (They do have C++ std::atomic, though). So I unfortunately can't easily check the asm output on non-x86.

#include <stdatomic.h>
#define likely(x)       __builtin_expect(!!(x), 1)
#define unlikely(x)     __builtin_expect(!!(x), 0)

typedef struct {
  atomic_int val;   // int is plenty big.  Must be signed, so an extra decrement doesn't make 0 wrap to >= 1
} naive_sem_t;

#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h>
static inline void spinloop_body(void) { _mm_pause(); }  // PAUSE is "rep nop" in asm output
#else
static inline void spinloop_body(void) { }
#endif

void sem_down(naive_sem_t *sem)
{
  while (1) {
    while (likely(atomic_load_explicit(&(sem->val), memory_order_acquire ) < 1))
        spinloop_body();  // wait for a the semaphore to be available
    int tmp = atomic_fetch_add_explicit( &(sem->val), -1, memory_order_acq_rel );  // try to take the lock.  Might only need mo_acquire
    if (likely(tmp >= 1))
        break;              // we successfully got the lock
    else                    // undo our attempt; another thread's decrement happened first
        atomic_fetch_add_explicit( &(sem->val), 1, memory_order_release ); // could be "relaxed", but we're still busy-waiting and want other thread to see this ASAP
  }
}
// note the release, not seq_cst.  Use a stronger ordering parameter if you want it to be a full barrier.
void sem_up(naive_sem_t *sem) {
    atomic_fetch_add_explicit(&(sem->val), 1, memory_order_release);
}

The trick here is that it's ok for val to temporarily be too low; that just makes other threads spin. Also note that fetch_add being a single atomic operation is key. It returns the old value, so we can detect when val was taken by another thread between the while loop's load and the fetch_add. (Note that we don't need to check that tmp is == to the while loop's load: it's fine if another thread uped the semaphore between the load and the fetch_add. This is a benefit to using fetch_add instead of cmpxchg).

The atomic_load spin loop is just a performance optimization over having all the waiters doing atomic read-modify-writes on val. (Although with many waiters trying to dec and then undo with inc, having a waiter ever see the lock unlocked could be very rare).

A real implementation would have special stuff for more platforms than just x86. For x86, probably more than just a PAUSE instruction inside the spinloop. This is still just a toy example of a fully-portable C11 implementation. PAUSE apparently helps avoid mis-speculation on memory ordering so the CPU runs more efficiently after leaving the spin loop. pause is not the same as yielding the logical CPU to the OS for a different thread to run. It also has nothing to do with correctness and choice of memory_order_??? parameters.

A real implementation would probably give up the CPU to the OS after some number of iterations of spinning (sched_yield(2), or more likely a futex system call, see below). Maybe use x86 MONITOR / MWAIT to be even more hyperthreading-friendly; I'm not sure. I haven't ever implemented locking myself for real, I just see all this stuff in the x86 insn reference while looking up other insns.

As mentioned previously, x86's lock xadd instruction implements fetch_add (with sequential-consistency semantics, since locked instructions are always a full memory barrier). On non-x86, using only acquire+release semantics for the fetch_add, not full sequential consistency might possibly allow more efficient code. I'm not sure, but using just acquire would quite likely allow more efficient code on ARM64. I think we do only need acquire on the fetch_add, not acq_rel, but I'm not sure. On x86 there won't be any difference in code, since locked instructions are the only way to do atomic read-modify-write, so even relaxed will be the same as seq_cst (except for compile-time reordering.)

If you want to yield the CPU instead of spinning, you need a system call (as people have said). Obviously a lot of work has gone into making the standard library locking as efficient as possible on Linux. There are dedicated system calls for helping the kernel wake the right thread(s) when a lock is released, and they are not simple to use. From futex(7):

NOTES
To reiterate, bare futexes are not intended as an easy-to-use abstraction for end-users. (There is no wrapper function for this system call in glibc.) Implementors are expected to be assembly literate and to have read the sources of the futex user-space library referenced below.

fairness / starvation (which my naive implementation ignores)

As the wikipedia article mentions, some kind of wakeup queue is a good idea, so the same thread doesn't keep getting the semaphore every time. (Code that takes a lock quickly after releasing it would usually have the releasing thread get the lock while other threads are still asleep).

This is another major benefit to kernel cooperation in the process (futex).

0人赞添加讨论(0) 举报