I have this simple example I was testing against and I noticed that gcc optimizations (-O3) seems not be as good as clang ones when operator new is involved. I was wondering what might be the issue and if it possible to force gcc to produce more optimized code somehow?
template<typename T>
T* create() { return new T(); }
int main() {
auto result = 0;
for (auto i = 0; i < 1000000; ++i) {
result += (create<int>() != nullptr);
}
return result;
}
#clang3.6++ -O3 -s --std=c++11 test.cpp
#size a.out
text data bss dec hex filename
1324 616 8 1948 79c a.out
#time ./a.out
real 0m0.002s
user 0m0.001s
sys 0m0.000s
#gcc4.9 -O3 -s --std=c++11 test.cpp
#size a.out
text data bss dec hex filename
1484 624 8 2116 844 a.out
#time ./a.out
real 0m0.045s
user 0m0.035s
sys 0m0.009s
Example above is just a simple version of the code I have been testing in the beginning,
but it still illustrates the difference between gcc/clang.
I checked the assembly code as well and there is not a huge difference in size, but definitely in performance. On the other hand maybe clang is doing something which is not allowed?
If we plug this code into godbolt we can see that clang
optimizes away the code to this:
main: # @main
movl $1000000, %eax # imm = 0xF4240
ret
while gcc
does not perform this optimization. So the question is this a valid optimization? Does this follow the as-if rule
, which is noted in the draft C++ standard section 1.9
Program execution which says(emphasis mine):
The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract machine. This International
Standard places no requirement on the structure of conforming
implementations. In particular, they need not copy or emulate the
structure of the abstract machine. Rather, conforming implementations
are required to emulate (only) the observable behavior of the abstract
machine as explained below.5
where note 5
says:
This provision is sometimes called the “as-if” rule, because an
implementation is free to disregard any requirement of this
International Standard as long as the result is as if the requirement
had been obeyed, as far as can be determined from the observable
behavior of the program. For instance, an actual implementation need
not evaluate part of an expression if it can deduce that its value is
not used and that no side effects affecting the observable behavior of
the program are produced.
Since new
could throw an exception which would have observable behavior since it would alter the return value of the program.
R.MartinhoFernandes argues that it is implementation detail when to throw an exception and therefore clang
could decide this scenario would not cause an exception and therefore eliding the new
call would not violate the as-if rule
. This seems like a reasonable argument to me.
but as T.C. points out:
A replacement global operator new could have been defined in a different translation unit
Casey provided an example that shows even when clang
sees there is a replacement it still performs this optimization even though there are lost side effects. So this seems like overly aggressive optimization.
Note, memory leaks are not undefined behavior.
The rationale is that there is no rule on how much memory a machine may have, nor does the language provide any way to examine the amount of memory allocated or free (though note that POSIX does define mallinfo). Here, we simulate your program on an abstract machine with infinite memory machine where the allocation continuously succeeds. Or at least, infinite memory for the purposes of the allocations in this loop but not consistently for a whole program. Anyhow, I'm aware of two good objections to this.
First, consider if it were malloc instead of operator new. The C99 spec states:
The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate. The malloc function returns either a null pointer or a pointer to the allocated space.
Compiling malloc() to always succeed seems to comply with that spec. But what if you call it more times than we could actually create a pointer for and only exit the loop once it fails? One possible way out is to note that there is no rule in the abstract machine definition that a 64-bit pointer can only hold 264 possible values, only that there is no way provided to construct values outside this range. It appears that the implementation may create such things at will. Personally I find that answer unsatisfying.
Consider that we also optimize things like "T *t1 = new T; T *t2 = (T*)rand();"
by assuming that t1
may not alias t2
. It doesn't matter if rand picked the right address or if you iterated across the whole address space, once we show that t1's address did not feed into t2, we should be able to conclude that they refer to different objects. While I would like that to be the way the standard worked, and that is how compilers work, I'm not aware of any standardese to support that position. This will likely become the subject of a future paper.
Second, operator new isn't malloc, it is a replaceable function. As was suggested in Casey's reply, we intend to follow the rules in N3664 (though I don't think clang treats new-expressions differently from explicit calls to operator new yet). Shafik pointed out that seems to violate causality but N3664 started life as N3433, and I'm pretty sure we wrote the optimization first and wrote the paper afterwards anyway.
It would appear that clang is optimizing memory allocations according to the changed rules in N3664 Clarifying Memory Allocation which was incorporated into C++14. N3664 allows reducing the number of calls to allocation/deallocation functions by merging allocations or eliminating allocations altogether.