I'm working on a function that stores a 64-bit value into memory in big endian format. I was hoping that I could write portable C99 code that works on both little and big endian platforms and have modern x86 compilers generate a bswap
instruction automatically without any builtins or intrinsics. So I started with the following function:
#include <stdint.h>
void
encode_bigend_u64(uint64_t value, void *vdest) {
uint64_t bigend;
uint8_t *bytes = (uint8_t*)&bigend;
bytes[0] = value >> 56;
bytes[1] = value >> 48;
bytes[2] = value >> 40;
bytes[3] = value >> 32;
bytes[4] = value >> 24;
bytes[5] = value >> 16;
bytes[6] = value >> 8;
bytes[7] = value;
uint64_t *dest = (uint64_t*)vdest;
*dest = bigend;
}
This works fine for clang which compiles this function to:
bswapq %rdi
movq %rdi, (%rsi)
retq
But GCC fails to detect the byte swap. I tried a couple of different approaches but they only made things worse. I know that GCC can detect byte swaps using bitwise-and, shift, and bitwise-or, but why doesn't it work when writing bytes?
Edit: I found the corresponding GCC bug.
I like Peter's solution, but here's something else you can use on Haswell. Haswell has the
movbe
instruction, which is 3 uops there (no cheaper thanbswap r64
+ a normal load or store), but is faster on Atom / Silvermont (https://agner.org/optimize/):Use it with something like
uint64_t tmp = load_bigend_u64(array[i]);
You could reverse this to make a
store_bigend
function, or usebswap
to modify a value in a register and let the compiler load/store it.I change the function to return
value
because alignment ofvdest
was not clear to me.Usually a feature is guarded by a preprocessor macro. I'd expect
__MOVBE__
to be used for themovbe
feature flag, but its not present (this machine has the feature):All functions in this answer with asm output on the Godbolt Compiler Explorer
GNU C has a
uint64_t __builtin_bswap64 (uint64_t x)
, since GNU C 4.3. This is apparently the most reliable way to get gcc / clang to generate code that doesn't suck for this.glibc provides
htobe64
,htole64
, and similar host to/from BE and LE functions that swap or not, depending on the endianness of the machine. See the docs for<endian.h>
. The man page says they were added to glibc in version 2.9 (released 2008-11).You safely get good code even at
-O1
from those functions, and they usemovbe
when-march
is set to a CPU that supports that insn.If you're targeting GNU C, but not glibc, you can borrow the definition from glibc (remember it's LGPLed code, though):
If you really need a fallback that might compile well on compilers that don't support GNU C builtins, the code from @bolov's answer could be used to implement a bswap that compiles nicely. Pre-processor macros could be used to choose whether to swap or not (like glibc does), to implement host-to-BE and host-to-LE functions. The bswap used by glibc when
__builtin_bswap
or x86 asm isn't available uses the mask-and-shift idiom that bolov found was good. gcc recognizes it better than just shifting.The code from this Endian-agnostic coding blog post compiles to bswap with gcc, but not with clang. IDK if there's anything that both their pattern-recognizers will recognize.
The
htonll
from this answer compiles to two 32bitbswap
s combined with shift/or. This kind of sucks, but isn't terrible with either gcc or clang.I didn't have any luck with a
union { uint64_t a; uint8_t b[8]; }
version of the OP's code. clang still compiles it to a 64bitbswap
, but I think compiles to even worse code with gcc. (See the godbolt link).This seems to do the trick:
clang with
-O3
clang with
-O3 -march=native
gcc with
-O3
gcc with
-O3 -march=native
Tested with clang 3.8.0 and gcc 5.3.0 on http://gcc.godbolt.org/ (so I don't know exactly what processor is underneath (for the
-march=native
) but I strongly suspect a recent x86_64 processor)If you want a function which works for big endian architectures too, you can use the answers from here to detect the endianness of the system and add an
if
. Both the union and the pointer casts versions work and are optimized by bothgcc
andclang
resulting in the exact same assembly (no branches). Full code on godebolt:Intel® 64 and IA-32 Architectures Instruction Set Reference (3-542 Vol. 2A):