This doesn't exactly seem to be right although I am unsure why. Advice would be great as the documentation for CMPXCHG16B is pretty minimal (I don't own any intel manuals...)
template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
/*
Description:
The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set,
and the RCX:RBX value is copied to the memory location.
Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
*/
uint64_t * cmpP = (uint64_t*)&cmp;
uint64_t * withP = (uint64_t*)&with;
unsigned char result = 0;
__asm__ __volatile__ (
"LOCK; CMPXCHG16B %1\n\t"
"SETZ %b0\n\t"
: "=q"(result) /* output */
: "m"(*src), /* input */
//what to compare against
"rax"( ((uint64_t) (cmpP[1])) ), //lower bits
"rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
//what to replace it with if it was equal
"rbx"( ((uint64_t) (withP[1])) ), //lower bits
"rcx"( ((uint64_t) (withP[0]) ) )//upper bits
: "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
);
return result;
}
When running with an example I am getting 0 when it should be 1. Any ideas?
I got it compiling for g++ with a slight change (removing oword ptr in cmpxchg16b instruction).
But it doesn't seem to overwrite the memory as required though I may be wrong.[See update] Code is given below followed by output.Output
Not sure the output makes sense to me. I was expecting the before value to be something like 00000000decafbad00000feedbeef according to struct definition. But bytes seem to be spread out within words. Is that due to aligned directive? Btw the CAS operation seem to return the correct return value though. Any help in deciphering this?
Update : I just did some debugging with memory inspection with gdb. There the correct values are shown there. So I guess this must be a problem with my print_dlong procedure. Feel free to correct it. I am leaving this reply as it is to be corrected, since a corrected version of this would be instructive of the cas operation with printed results.
Noticed a few issues,
(1) The main problem is the constraints, "rax" doesn't do what it looks like, rather the first character "r" lets gcc use any register.
(2) Not sure how your storing types::uint128_t, but assuming the standard little endian for x86 platforms, then the high and low dwords are also swapped around.
(3) Taking the address of something and casting it to something else can break aliasing rules. Depends on how your types::uint128_t is defined as to wether or not this is an issue (fine if it is a struct of two uint64_t's). GCC with -O2 will optimize assuming aliasing rules are not violated.
(4) *src should really be marked as an output, rather than specifying memory clobber. but this is really more of a performance rather than correctness issue. similarly rbx and rcx do not need to specified as clobbered.
Here is a a version that works,
It's good to note that if you're using GCC, you don't need to use inline asm to get at this instruction. You can use one of the __sync functions, like:
Microsoft has a similar function for VC++:
All Intel documentation is available for free: Intel® 64 and IA-32 Architectures Software Developer's Manuals.