可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Consider the following statement:
*((char*)NULL) = 0; //undefined behavior
It clearly invokes undefined behavior. Does the existence of such a statement in a given program mean that the whole program is undefined or that behavior only becomes undefined once control flow hits this statement?
Would the following program be well-defined in case the user never enters the number 3
?
while (true) {
int num = ReadNumberFromConsole();
if (num == 3)
*((char*)NULL) = 0; //undefined behavior
}
Or is it entirely undefined behavior no matter what the user enters?
Also, can the compiler assume that undefined behavior will never be executed at runtime? That would allow for reasoning backwards in time:
int num = ReadNumberFromConsole();
if (num == 3) {
PrintToConsole(num);
*((char*)NULL) = 0; //undefined behavior
}
Here, the compiler could reason that in case num == 3
we will always invoke undefined behavior. Therefore, this case must be impossible and the number does not need to be printed. The entire if
statement could be optimized out. Is this kind of backwards reasoning allowed according to the standard?
回答1:
Does the existence of such a statement in a given program mean that
the whole program is undefined or that behavior only becomes undefined
once control flow hits this statement?
Neither. The first condition is too strong and the second is too weak.
Object access are sometimes sequenced, but the standard describes the behavior of the program outside of time. Danvil already quoted:
if any such execution contains an undefined operation, this
International Standard places no requirement on the implementation
executing that program with that input (not even with regard to
operations preceding the first undefined operation)
This can be interpreted:
If the execution of the program yields undefined behavior, then the whole program has
undefined behavior.
So, an unreachable statement with UB doesn't give the program UB. A reachable statement that (because of the values of inputs) is never reached, doesn't give the program UB. That's why your first condition is too strong.
Now, the compiler cannot in general tell what has UB. So to allow the optimizer to re-order statements with potential UB that would be re-orderable should their behavior be defined, it's necessary to permit UB to "reach back in time" and go wrong prior to the preceding sequence point (or in C++11 terminology, for the UB to affect things that are sequenced before the UB thing). Therefore your second condition is too weak.
A major example of this is when the optimizer relies on strict aliasing. The whole point of the strict aliasing rules is to allow the compiler to re-order operations that could not validly be re-ordered if it were possible that the pointers in question alias the same memory. So if you use illegally aliasing pointers, and UB does occur, then it can easily affect a statement "before" the UB statement. As far as the abstract machine is concerned the UB statement has not been executed yet. As far as the actual object code is concerned, it has been partly or fully executed. But the standard doesn't try to get into detail about what it means for the optimizer to re-order statements, or what the implications of that are for UB. It just gives the implementation license to go wrong as soon as it pleases.
You can think of this as, "UB has a time machine".
Specifically to answer your examples:
- Behavior is only undefined if 3 is read.
- Compilers can and do eliminate code as dead if a basic block contains an operation certain to be undefined. They're permitted (and I'm guessing do) in cases which aren't a basic block but where all branches lead to UB. This example isn't a candidate unless
PrintToConsole(3)
is somehow known to be sure to return. It could throw an exception or whatever.
A similar example to your second is the gcc option -fdelete-null-pointer-checks
, which can take code like this (I haven't checked this specific example, consider it illustrative of the general idea):
void foo(int *p) {
if (p) *p = 3;
std::cout << *p << '\n';
}
and change it to:
*p = 3;
std::cout << "3\n";
Why? Because if p
is null then the code has UB anyway, so the compiler may assume it is not null and optimize accordingly. The linux kernel tripped over this (https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-1897) essentially because it operates in a mode where dereferencing a null pointer isn't supposed to be UB, it's expected to result in a defined hardware exception that the kernel can handle. When optimization is enabled, gcc requires the use of -fno-delete-null-pointer-checks
in order to provide that beyond-standard guarantee.
P.S. The practical answer to the question "when does undefined behavior strike?" is "10 minutes before you were planning to leave for the day".
回答2:
The standard states at 1.9/4
[ Note: This International Standard imposes no requirements on the
behavior of programs that contain undefined behavior. — end note ]
The interesting point is probably what "contain" means. A little later at 1.9/5 it states:
However, if any such execution contains an undefined operation, this
International Standard places no requirement on the implementation
executing that program with that input (not even with regard to
operations preceding the first undefined operation)
Here it specifically mentions "execution ... with that input". I would interpret that as, undefined behaviour in one possible branch which is not executed right now does not influence the current branch of execution.
A different issue however are assumptions based on undefined behaviour during code generation. See the answer of Steve Jessop for more details about that.
回答3:
An instructive example is
int foo(int x)
{
int a;
if (x)
return a;
return 0;
}
Both current GCC and current Clang will optimize this (on x86) to
xorl %eax,%eax
ret
because they deduce that x
is always zero from the UB in the if (x)
control path. GCC won't even give you a use-of-uninitialized-value warning! (because the pass that applies the above logic runs before the pass that generates uninitialized-value warnings)
回答4:
The current C++ working draft says in 1.9.4 that
This International Standard imposes no requirements on the behavior of programs that contain undefined behavior.
Based on this, I would say that a program containing undefined behavior on any execution path can do anything at every time of its execution.
There are two really good articles on undefined behavior and what compilers usually do:
- A Guide to Undefined Behavior in C and C++
- What Every C Programmer Should Know About Undefined Behavior
回答5:
The word "behavior" means something is being done. A statemenr that is never executed is not "behavior".
An illustration:
*ptr = 0;
Is that undefined behavior? Suppose we are 100% certain ptr == nullptr
at least once during program execution. The answer should be yes.
What about this?
if (ptr) *ptr = 0;
Is that undefined? (Remember ptr == nullptr
at least once?) I sure hope not, otherwise you won't be able to write any useful program at all.
No srandardese was harmed in the making of this answer.
回答6:
The undefined behavior strikes when the program will cause undefined behavior no matter what happens next. However, you gave the following example.
int num = ReadNumberFromConsole();
if (num == 3) {
PrintToConsole(num);
*((char*)NULL) = 0; //undefined behavior
}
Unless the compiler knows definition of PrintToConsole
, it cannot remove if (num == 3)
conditional. Let's assume that you have LongAndCamelCaseStdio.h
system header with the following declaration of PrintToConsole
.
void PrintToConsole(int);
Nothing too helpful, all right. Now, let's see how evil (or perhaps not so evil, undefined behavior could have been worse) the vendor is, by checking actual definition of this function.
int printf(const char *, ...);
void exit(int);
void PrintToConsole(int num) {
printf("%d\n", num);
exit(0);
}
The compiler actually has to assume that any arbitrary function the compiler doesn't know what does it do may exit or throw an exception (in case of C++). You can notice that *((char*)NULL) = 0;
won't be executed, as the execution won't continue after PrintToConsole
call.
The undefined behavior strikes when PrintToConsole
actually returns. The compiler expects this not to happen (as this would cause the program to execute undefined behavior no matter what), therefore anything can happen.
However, let's consider something else. Let's say we are doing null check, and use the variable after null check.
int putchar(int);
const char *warning;
void lol_null_check(const char *pointer) {
if (!pointer) {
warning = "pointer is null";
}
putchar(*pointer);
}
In this case, it's easy to notice that lol_null_check
requires a non-NULL pointer. Assigning to the global non-volatile warning
variable is not something that could exit the program or throw any exception. The pointer
is also non-volatile, so it cannot magically change its value in middle of function (if it does, it's undefined behavior). Calling lol_null_check(NULL)
will cause undefined behavior which may cause the variable to not be assigned (because at this point, the fact that the program executes the undefined behavior is known).
However, the undefined behavior means the program can do anything. Therefore, nothing stops the undefined behavior from going back in the time, and crashing your program before first line of int main()
executes. It's undefined behavior, it doesn't have to make sense. It may as well crash after typing 3, but the undefined behavior will go back in time, and crash before you even type 3. And who knows, perhaps undefined behavior will overwrite your system RAM, and cause your system to crash 2 weeks later, while your undefined program is not running.
回答7:
If the program reaches a statement that invokes undefined behavior, no requirements are placed on any of the program's output/behavior whatsoever; it doesn't matter whether they would take place "before" or "after" undefined behavior is invoked.
Your reasoning about all three code snippets is correct. In particular, a compiler may treat any statement which unconditionally invokes undefined behavior the way GCC treats __builtin_unreachable()
: as an optimization hint that the statement is unreachable (and thereby, that all code paths leading unconditionally to it are also unreachable). Other similar optimizations are of course possible.
回答8:
Many standards for many kinds of things expend a lot of effort on describing things which implementations SHOULD or SHOULD NOT do, using nomenclature similar to that defined in IETF RFC 2119 (though not necessarily citing the definitions in that document). In many cases, descriptions of things that implementations should do except in cases where they would be useless or impractical are more important than the requirements to which all conforming implementations must conform.
Unfortunately, C and C++ Standards tend to eschew descriptions of things which, while not 100% required, should nonetheless be expected of quality implementations which don't document contrary behavior. A suggestion that implementations should do something might be seen as implying that those which don't are inferior, and in cases where it would generally be obvious which behaviors would be useful or practical, versus impractical and useless, on a given implementation, there was little perceived need for the Standard to interfere with such judgments.
A clever compiler could conform to the Standard while eliminating any code that would have no effect except when code receives inputs that would inevitably cause Undefined Behavior, but "clever" and "dumb" are not antonyms. The fact that the authors of the Standard decided that there might be some kinds of implementations where behaving usefully in a given situation would be useless and impractical does not imply any judgment as to whether such behaviors should be considered practical and useful on others. If an implementation could uphold a behavioral guarantee for no cost beyond the loss of a "dead-branch" pruning opportunity, almost any value user code could receive from that guarantee would exceed the cost of providing it. Dead-branch elimination may be fine in cases where it wouldn't require giving up anything, but if in a given situation user code could have handled almost any possible behavior other than dead-branch elimination, any effort user code would have to expend to avoid UB would likely exceed the value achieved from DBE.