I am new to the low level stuff so I am completely oblivious of what kind of problems you might face down there and I am not even sure if I understand the term "atomic" right. Right now I am trying to make simple atomic locks around memory manipulation via extended assembly. Why? For sake of curiosity. I know I am reinventing the wheel here and possibly oversimplifying the whole process.
The question? Does the code I present here achive the goal of making memory manipulation both threadsafe and reentrant?
- If it works, why?
- If it doesn't work, why?
- Not good enough? Should I for example make use of the register keyword in C?
What I simply want to do...
- Before memory manipulation, lock.
- After memory manipulation, unlock.
The code:
volatile int atomic_gate_memory = 0;
static inline void atomic_open(volatile int *gate)
{
asm volatile (
"wait:\n"
"cmp %[lock], %[gate]\n"
"je wait\n"
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (1)
);
}
static inline void atomic_close(volatile int *gate)
{
asm volatile (
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (0)
);
}
Then something like:
void *_malloc(size_t size)
{
atomic_open(&atomic_gate_memory);
void *mem = malloc(size);
atomic_close(&atomic_gate_memory);
return mem;
}
#define malloc(size) _malloc(size)
.. same for calloc, realloc, free and fork(for linux).
#ifdef _UNISTD_H
int _fork()
{
pid_t pid;
atomic_open(&atomic_gate_memory);
pid = fork();
atomic_close(&atomic_gate_memory);
return pid;
}
#define fork() _fork()
#endif
After loading the stackframe for atomic_open, objdump generates:
00000000004009a7 <wait>:
4009a7: 39 10 cmp %edx,(%rax)
4009a9: 74 fc je 4009a7 <wait>
4009ab: 89 10 mov %edx,(%rax)
Also, given the disassembly above; can I assume I am making an atomic operation because it is only one instruction?
register
is a meaningless hint in modern optimizing compilers.I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like this. Of course a real implementation would use a system call (like Linux
futex
) after spinning for a while, and unlocking would have to check if it needs to notify any waiters with another system call. This is important; you don't want to spin forever wasting CPU time (and energy / heat) doing nothing. But conceptually this is the spin part of a spinlock before you take the fallback path. It's an important piece of how light-weight locking is implemented. (Only attempting to take the lock once before calling the kernel would be a valid choice, instead of spinning at all.)Implement as much of this as you like in inline asm, or preferably using C11
stdatomic
, like this semaphore implementation.If you were using a bitfield of atomic flags, you could use
lock bts
(test and set) for the equivalent of xchg-with-1. You can spin onbt
ortest
. To unlock, you'd needlock btr
, not justbtr
, because it would be a non-atomic read-modify-write of the byte, or even the containing 32bits.With a byte or word sized lock, you don't even need a
lock
ed operation to unlock; release semantics are enough. glibc'spthread_spin_unlock
does the same as my unlock function: a simple store.This avoids writing to the lock if we see it's already locked. This avoids invalidating the cache line in L1 of the core running the thread that owns it, so it can go back to "Modified" (MESIF or MOESI) with less cache coherence delay during unlock.
We also don't flood the CPU with
lock
ed operations in a loop. I'm not sure how much this slows things down in general, but 10 threads all waiting for the same spinlock will keep the memory arbitration hardware pretty busy. This might slow down the thread that does hold the lock, or other unrelated threads on the system, while they use other locks, or memory in general.PAUSE
is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want topause
in the un-contended case. On Skylake,PAUSE
waits a lot longer, like ~100cycles IIRC, so you should definitely keep the spinloop separate from the initial check for unlocked.I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.