Here's a simple function that tries to do read a generic twos-complement integer from a big-endian buffer, where we'll assume std::is_signed_v<INT_T>
:
template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
INT_T result = 0;
for (size_t i = 0; i < sizeof(INT_T); i++) {
result <<= 8;
result |= *data;
data++;
}
return result;
}
Unfortunately, this is undefined behaviour, as the last <<=
shifts into the sign bit.
So now we try the following:
template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
std::make_unsigned_t<INT_T> result = 0;
for (size_t i = 0; i < sizeof(INT_T); i++) {
result <<= 8;
result |= *data;
data++;
}
return static_cast<INT_T>(result);
}
But we're now invoking implementation-defined behaviour in the static_cast
, converting from unsigned to signed.
How can I do this while staying in the "well-defined" realm?
To propose an alternative solution, the best way to copy bits and avoid UB is through
memcpy
:With this you won't get UB from casting an unsigned to signed type, and with optomizations, this compiles to the exact same assembly as your examples.
Compiles with
clang++ /tmp/test.cpp -std=c++17 -c -O3
to:on x86_64-linux-gnu with
clang++ v8
.Most of the time,
memcpy
with optimizations will compile to the exact same assembly as what you intend, but with the added benefit of no UB.Updating for corectness: The OP correctly notes that this would still be invalid since signed int representations do not need to be two's complement (at least until C++20) and this would be implementation-defined behavior.
AFAICT, up until C++20, there doesn't actually seem to be a neat C++ way of performing bit-level operations on ints without actually knowing the bit representation of a signed int, which is implementation-defined. That being said, as long as you know your compiler will represent a C++ integral type as two's complement, then both using
memcpy
or thestatic_cast
in the OP's second example should work.Part of the major reason C++20 is exclusively representing signed ints as two's complement is because most existing compilers already represent them as two's complement. Both GCC and LLVM (and thus Clang) already internally use two's complement.
This doesn't seem entirely portable (and it's understandable if this isn't the best answer), but I would imagine that you know what compiler you'll be building your code with, so you can technically wrap this or your second example with checks to see you're using an appropriate compiler.
Start by assembling bytes into an unsigned value. Unless you need to assemble groups of 9 or more octets, a conforming C99 implementation is guaranteed to have such a type that is large enough to hold them all (a C89 implementation would be guaranteed to have an unsigned type large enough to hold at least four).
In most cases, where you want to convert a sequence of octets to a number, you'll know how many octets you're expecting. If data is encoded as 4 bytes, you should use four bytes regardless of the sizes of
int
andlong
(a portable function should return typelong
).Note that the subtraction is done as two parts, each within the range of a signed long, to allow for the possibility of systems where
LNG_MIN
is -2147483647. Attempting to convert byte sequence {0,0,0,0x80} on such a system may yield Undefined Behavior [since it would compute the value -2147483648] but the code should process in fully portable fashion all values which would be within the range of "long".Actually, in C++17, left-shifting a signed integer that has a negative value is undefined behavior. Left-shifting a signed integer that has a positive value into the sign bit is implementation defined behavior. See also:
(C++17 final working draft, Section 8.8 Shift operators [expr.shift], Paragraph 2, page 132 - emphasis mine)
With C++20, shifting into the sign bit changed from implementation defined to defined behavior:
(C++20 latest working draft, Section 7.6.7 Shift operators [expr.shift], Paragraph 2, page 129)
Example:
Assertion:
-2
is the unique value that is congruent to2147483647 * 2 % 2**32
Check:
The value
-2
is unique because there is no other value in the domain[INT_MIN .. INT_MAX]
that satisfies this congruence relation.This is a consequence of C++20 mandating two's complement representation of signed integer types:
(C++20 latest working draft, Section 6.8.1 Fundamental types [basic.fundamental], Paragraph 3, page 66)
This means that with C++20, your original example invokes defined behavior, as-is.
Additional note: not that this proves anything, but the GCC/Clang undefined behavior sanitizer (invoked with
-fsanitize=undefined
) only triggers when compiling this example for std <= C++17 and then only complains about the shifting of the negative value (both as expected):Example session (on Fedora 31):