可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have this simple example I was testing against and I noticed that gcc optimizations (-O3) seems not be as good as clang ones when operator new is involved. I was wondering what might be the issue and if it possible to force gcc to produce more optimized code somehow?

template<typename T>
T* create() { return new T(); }

int main() {
    auto result = 0;
    for (auto i = 0; i < 1000000; ++i) {
        result += (create<int>() != nullptr);
    }

    return result;
}


#clang3.6++ -O3 -s --std=c++11 test.cpp
#size a.out
   text    data     bss     dec     hex filename
   1324     616       8    1948     79c a.out
#time ./a.out 
real 0m0.002s
user 0m0.001s
sys  0m0.000s

#gcc4.9 -O3 -s --std=c++11 test.cpp
#size a.out
   text    data     bss     dec     hex filename
   1484     624       8    2116     844 a.out
#time ./a.out
real 0m0.045s
user 0m0.035s
sys  0m0.009s

Example above is just a simple version of the code I have been testing in the beginning, but it still illustrates the difference between gcc/clang. I checked the assembly code as well and there is not a huge difference in size, but definitely in performance. On the other hand maybe clang is doing something which is not allowed?

回答1:

If we plug this code into godbolt we can see that clang optimizes away the code to this:

main:                                   # @main
movl    $1000000, %eax          # imm = 0xF4240
ret

while gcc does not perform this optimization. So the question is this a valid optimization? Does this follow the as-if rule, which is noted in the draft C++ standard section 1.9 Program execution which says(emphasis mine):

The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.⁵

where note 5 says:

This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.

Since new could throw an exception which would have observable behavior since it would alter the return value of the program.

R.MartinhoFernandes argues that it is implementation detail when to throw an exception and therefore clang could decide this scenario would not cause an exception and therefore eliding the new call would not violate the as-if rule. This seems like a reasonable argument to me.

but as T.C. points out:

A replacement global operator new could have been defined in a different translation unit

Casey provided an example that shows even when clang sees there is a replacement it still performs this optimization even though there are lost side effects. So this seems like overly aggressive optimization.

Note, memory leaks are not undefined behavior.

回答2:

The rationale is that there is no rule on how much memory a machine may have, nor does the language provide any way to examine the amount of memory allocated or free (though note that POSIX does define mallinfo). Here, we simulate your program on an abstract machine with infinite memory machine where the allocation continuously succeeds. Or at least, infinite memory for the purposes of the allocations in this loop but not consistently for a whole program. Anyhow, I'm aware of two good objections to this.

First, consider if it were malloc instead of operator new. The C99 spec states:

The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate. The malloc function returns either a null pointer or a pointer to the allocated space.

Compiling malloc() to always succeed seems to comply with that spec. But what if you call it more times than we could actually create a pointer for and only exit the loop once it fails? One possible way out is to note that there is no rule in the abstract machine definition that a 64-bit pointer can only hold 2⁶⁴ possible values, only that there is no way provided to construct values outside this range. It appears that the implementation may create such things at will. Personally I find that answer unsatisfying.

Consider that we also optimize things like "T *t1 = new T; T *t2 = (T*)rand();" by assuming that t1 may not alias t2. It doesn't matter if rand picked the right address or if you iterated across the whole address space, once we show that t1's address did not feed into t2, we should be able to conclude that they refer to different objects. While I would like that to be the way the standard worked, and that is how compilers work, I'm not aware of any standardese to support that position. This will likely become the subject of a future paper.

Second, operator new isn't malloc, it is a replaceable function. As was suggested in Casey's reply, we intend to follow the rules in N3664 (though I don't think clang treats new-expressions differently from explicit calls to operator new yet). Shafik pointed out that seems to violate causality but N3664 started life as N3433, and I'm pretty sure we wrote the optimization first and wrote the paper afterwards anyway.

回答3:

It would appear that clang is optimizing memory allocations according to the changed rules in N3664 Clarifying Memory Allocation which was incorporated into C++14. N3664 allows reducing the number of calls to allocation/deallocation functions by merging allocations or eliminating allocations altogether.