I recently came across a strange deoptimization (or rather missed optimization opportunity).
Consider this function for efficient unpacking of arrays of 3-bit integers to 8-bit integers. It unpacks 16 ints in each loop iteration:
void unpack3bit(uint8_t* target, char* source, int size) {
while(size > 0){
uint64_t t = *reinterpret_cast<uint64_t*>(source);
target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;
target[3] = (t >> 9) & 0x7;
target[4] = (t >> 12) & 0x7;
target[5] = (t >> 15) & 0x7;
target[6] = (t >> 18) & 0x7;
target[7] = (t >> 21) & 0x7;
target[8] = (t >> 24) & 0x7;
target[9] = (t >> 27) & 0x7;
target[10] = (t >> 30) & 0x7;
target[11] = (t >> 33) & 0x7;
target[12] = (t >> 36) & 0x7;
target[13] = (t >> 39) & 0x7;
target[14] = (t >> 42) & 0x7;
target[15] = (t >> 45) & 0x7;
source+=6;
size-=6;
target+=16;
}
}
Here is the generated assembly for parts of the code:
...
367: 48 89 c1 mov rcx,rax
36a: 48 c1 e9 09 shr rcx,0x9
36e: 83 e1 07 and ecx,0x7
371: 48 89 4f 18 mov QWORD PTR [rdi+0x18],rcx
375: 48 89 c1 mov rcx,rax
378: 48 c1 e9 0c shr rcx,0xc
37c: 83 e1 07 and ecx,0x7
37f: 48 89 4f 20 mov QWORD PTR [rdi+0x20],rcx
383: 48 89 c1 mov rcx,rax
386: 48 c1 e9 0f shr rcx,0xf
38a: 83 e1 07 and ecx,0x7
38d: 48 89 4f 28 mov QWORD PTR [rdi+0x28],rcx
391: 48 89 c1 mov rcx,rax
394: 48 c1 e9 12 shr rcx,0x12
398: 83 e1 07 and ecx,0x7
39b: 48 89 4f 30 mov QWORD PTR [rdi+0x30],rcx
...
It looks quite efficent. Simply a shift right
followed by an and
, and then a store
to the target
buffer. But now, look what happens when I change the function to a method in a struct:
struct T{
uint8_t* target;
char* source;
void unpack3bit( int size);
};
void T::unpack3bit(int size) {
while(size > 0){
uint64_t t = *reinterpret_cast<uint64_t*>(source);
target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;
target[3] = (t >> 9) & 0x7;
target[4] = (t >> 12) & 0x7;
target[5] = (t >> 15) & 0x7;
target[6] = (t >> 18) & 0x7;
target[7] = (t >> 21) & 0x7;
target[8] = (t >> 24) & 0x7;
target[9] = (t >> 27) & 0x7;
target[10] = (t >> 30) & 0x7;
target[11] = (t >> 33) & 0x7;
target[12] = (t >> 36) & 0x7;
target[13] = (t >> 39) & 0x7;
target[14] = (t >> 42) & 0x7;
target[15] = (t >> 45) & 0x7;
source+=6;
size-=6;
target+=16;
}
}
I thought the generated assembly should be quite the same, but it isn't. Here is a part of it:
...
2b3: 48 c1 e9 15 shr rcx,0x15
2b7: 83 e1 07 and ecx,0x7
2ba: 88 4a 07 mov BYTE PTR [rdx+0x7],cl
2bd: 48 89 c1 mov rcx,rax
2c0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD!
2c3: 48 c1 e9 18 shr rcx,0x18
2c7: 83 e1 07 and ecx,0x7
2ca: 88 4a 08 mov BYTE PTR [rdx+0x8],cl
2cd: 48 89 c1 mov rcx,rax
2d0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD!
2d3: 48 c1 e9 1b shr rcx,0x1b
2d7: 83 e1 07 and ecx,0x7
2da: 88 4a 09 mov BYTE PTR [rdx+0x9],cl
2dd: 48 89 c1 mov rcx,rax
2e0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD!
2e3: 48 c1 e9 1e shr rcx,0x1e
2e7: 83 e1 07 and ecx,0x7
2ea: 88 4a 0a mov BYTE PTR [rdx+0xa],cl
2ed: 48 89 c1 mov rcx,rax
2f0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD!
...
As you see, we introduced an additional redundant load
from memory before each shift (mov rdx,QWORD PTR [rdi]
). It seems like the target
pointer (which is now a member instead of a local variable) has to be always reloaded before storing into it. This slows down the code considerably (around 15% in my measurements).
First I thought maybe the C++ memory model enforces that a member pointer may not be stored in a register but has to be reloaded, but this seemed like an awkward choice, as it would make a lot of viable optimizations impossible. So I was very surprised that the compiler did not store target
in a register here.
I tried caching the member pointer myself into a local variable:
void T::unpack3bit(int size) {
while(size > 0){
uint64_t t = *reinterpret_cast<uint64_t*>(source);
uint8_t* target = this->target; // << ptr cached in local variable
target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;
target[3] = (t >> 9) & 0x7;
target[4] = (t >> 12) & 0x7;
target[5] = (t >> 15) & 0x7;
target[6] = (t >> 18) & 0x7;
target[7] = (t >> 21) & 0x7;
target[8] = (t >> 24) & 0x7;
target[9] = (t >> 27) & 0x7;
target[10] = (t >> 30) & 0x7;
target[11] = (t >> 33) & 0x7;
target[12] = (t >> 36) & 0x7;
target[13] = (t >> 39) & 0x7;
target[14] = (t >> 42) & 0x7;
target[15] = (t >> 45) & 0x7;
source+=6;
size-=6;
this->target+=16;
}
}
This code also yields the "good" assembler without additional stores. So my guess is: The compiler is not allowed to hoist the load of a member pointer of a struct, so such a "hot pointer" should always be stored in a local variable.
- So, why is the compiler unable to optimize out these loads?
- Is it the C++ memory model that forbids this? Or is it simply a shortcoming of my compiler?
- Is my guess correct or what is the exact reason why the optimization can't be performed?
The compiler in use was g++ 4.8.2-19ubuntu1
with -O3
optimization. I also tried clang++ 3.4-1ubuntu3
with similar results: Clang is even able to vectorize the method with the local target
pointer. However, using the this->target
pointer yields the same result: An extra load of the pointer before each store.
I checked the assembler of some similar methods and the result is the same: It seems that a member of this
always has to be reloaded before a store, even if such a load could simply be hoisted outside the loop. I will have to rewrite a lot of code to get rid of these additional stores, mainly by caching the pointer myself into a local variable that is declared above the hot code. But I always thought fiddling with such details as caching a pointer in a local variable would surely qualify for premature optimization in these days where compilers have gotten so clever. But it seems I am wrong here. Caching a member pointer in a hot loop seems to be a necessary manual optimization technique.
Pointer aliasing seems to be the problem, ironically between
this
andthis->target
. The compiler is taking into account the rather obscene possibility that you initialized:this->target = &this
In that case, writing to
this->target[0]
would alter the contents ofthis
(and thus, this->target).The memory aliasing problem is not restricted to the above. In principle, any use of
this->target[XX]
given an (in)appropriate value ofXX
might point tothis
.I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.
The issue here is strict aliasing which says that we are allowed to alias through a char* and so that prevents compiler optimization in your case. We are not allowed to alias through a pointer of a different type which would be undefined behavior, normally on SO we see this problem which is users attempting to alias through incompatible pointer types.
It would seem reasonable to implement uint8_t as a unsigned char and if we look at cstdint on Coliru it includes stdint.h which typedefs uint8_t as follows:
if you used another non-char type then the compiler should be able to optimize.
This is covered in the draft C++ standard section
3.10
Lvalues and rvalues which says:and includes the following bullet:
Note, I posted a comment on possible work arounds in a question that asks When is uint8_t ≠ unsigned char? and the recommendation was:
Since C++ does not support the restrict keyword you have to rely on compiler extension, for example gcc uses __restrict__ so this is not totally portable but the other suggestion should be.
Strict aliasing rules allows
char*
to alias any other pointer. Sothis->target
may alias withthis
, and in your code method, the first part of the code,is in fact
as
this
may be modified when you modifythis->target
content.Once
this->target
is cached into a local variable, the alias is no longer possible with the local variable.