可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'ve been searching for a while, but can\'t find a clear answer.
Lots of people say that using unions to type-pun is undefined and bad practice. Why is this? I can\'t see any reason why it would do anything undefined considering the memory you write the original information to isn\'t going to just change of its own accord (unless it goes out of scope on the stack, but that\'s not a union issue, that would be bad design).
People quote the strict aliasing rule, but that seems to me to be like saying you can\'t do it because you can\'t do it.
Also what is the point of a union if not to type pun? I saw somewhere that they are supposed to be used to use the same memory location for different information at different times, but why not just delete the info before using it again?
To summarise:
- Why is it bad to use unions for type punning?
- What it the point of them if not this?
Extra information: I\'m using mainly C++, but would like to know about that and C. Specifically I\'m using unions to convert between floats and the raw hex to send via CAN bus.
回答1:
To re-iterate, type-punning through unions is perfectly fine in C (but not in C++). In contrast, using pointer casts to do so violates C99 strict aliasing and is problematic because different types may have different alignment requirements and you could raise a SIGBUS if you do it wrong. With unions, this is never a problem.
The relevant quotes from the C standards are:
C89 section 3.3.2.3 §5:
if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined
C11 section 6.5.2.3 §3:
A postfix expression followed by the . operator and an identifier designates a member of a structure or union object. The value is that of the named member
with the following footnote 95:
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.
This should be perfectly clear.
James is confused because C11 section 6.7.2.1 §16 reads
The value of at most one of the members can be stored in a union object at any time.
This seems contradictory, but it is not: In contrast to C++, in C, there is no concept of active member and it\'s perfectly fine to access the single stored value through an expression of an incompatible type.
See also C11 annex J.1 §1:
The values of bytes that correspond to union members other than the one last stored into [are unspecified].
In C99, this used to read
The value of a union member other than the last one stored into [is unspecified]
This was incorrect. As the annex isn\'t normative, it did not rate its own TC and had to wait until the next standard revision to get fixed.
GNU extensions to standard C++ (and to C90) do explicitly allow type-punning with unions. Other compilers that don\'t support GNU extensions may also support union type-punning, but it\'s not part of the base language standard.
回答2:
Unions original purpose was to save space when you want to be able to be able to represent different types, what we call a variant type see Boost.Variant as a good example of this.
The other common use is type punning the validity of this is debated but practically most compiler support it, we can see that gcc documents its support:
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. So, the code above works as expected.
note it says even with -fstrict-aliasing, type-punning is allowed which indicates there is an aliasing issue at play.
Pascal Cuoq has argued that defect report 283 clarified this was allowed in C. Defect report 283 added the following footnote as clarification:
If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called \"type punning\"). This might be a trap representation.
in C11 that would be footnote 95
.
Although in the std-discussion
mail group topic Type Punning via a Union the argument is made this is underspecified, which seems reasonable since DR 283
did not add new normative wording, just a footnote:
This is, in my opinion, an underspecified semantic quagmire in C.
Consensus has not been reached between implementors and the C
committee as to exactly which cases have defined behavior and which do
not[...]
In C++ it is unclear whether is defined behavior or not.
This discussion also covers at least one reason why allowing type punning through a union is undesirable:
[...]the C standard\'s rules break the type-based alias
analysis optimizations which current implementations perform.
it breaks some optimizations. The second argument against this is that using memcpy should generate identical code and is does not break optimizations and well defined behavior, for example this:
std::int64_t n;
std::memcpy(&n, &d, sizeof d);
instead of this:
union u1
{
std::int64_t n;
double d ;
} ;
u1 u ;
u.d = d ;
and we can see using godbolt this does generate identical code and the argument is made if your compiler does not generate identical code it should be considered a bug:
If this is true for your implementation, I suggest you file a bug on it. Breaking real optimizations (anything based on type-based alias analysis) in order to work around performance issues with some particular compiler seems like a bad idea to me.
The blog post Type Punning, Strict Aliasing, and Optimization also comes to a similar conclusion.
The undefined behavior mailing list discussion: Type punning to avoid copying covers a lot of the same ground and we can see how grey the territory can be.
回答3:
It\'s legal in C99:
From the standard:
6.5.2.3 Structure and union members
If the member used to access the contents of a union object is not the
same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as described
in 6.2.6 (a process sometimes called \"type punning\"). This might be a
trap representation.
回答4:
BRIEF ANSWER: Type punning can be safe in a few circumstances. On the other hand, although it seems to be a very well known practice, it seems that standard is not very interested in make it official.
I will talk only about C (not C++).
1. TYPE PUNNING and THE STANDARDS
As folks already pointed but, type punning is allowed in the standard C99 and also C11, in subsection 6.5.2.3. However, I will rewrite facts with my own perception of the issue:
- The section 6.5 of standard documents C99 and C11 develop the topic of expressions.
- The subsection 6.5.2 is referred to postfix expressions.
- The subsubsection 6.5.2.3 talks about structs and unions.
- The paragraph 6.5.2.3(3) explains the dot operator applied to a
struct
or union
object, and which value will be obtained.
Just there, the footnote 95 appears. This footnote says:
If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called \"type punning\"). This might be a trap representation.
The fact that type punning barely appears, and as a footnote, it gives a clue that it\'s not a relevant issue in C programming.
Actually, the main purpose for using unions
is for saving space (in memory). Since several members are sharing the same address, if one knows that each member will be used different parts of the program, never at the same time, then a union
can be used instead a struct
, for saving memory.
- The subsection 6.2.6 is mentioned.
- The subsection 6.2.6 talks about how objects are represented (in memory, say).
2. REPRESENTATION OF TYPES and ITS TROUBLE
If you pay attention to the different aspects of the standard, you can be sure of almost nothing:
- The representation of pointers is not clearly specified.
- Worst, pointers having different types could have a different representation (as objects in memory).
union
members share the same heading address in memory, and it\'s the same address that of the union
object itself.
struct
members have increasing relative address, by starting in exactly the same memory address that of the struct
object itself. However, padding bytes can be added at the end of every member. How many? It\'s unpredictable. Padding bytes are used mainly for memory allignment purposes.
- Arithmetical types (integers, floating point real and complex numbers) could be representable in a number of ways. It depends on the implementation.
- In particular, integer types could have padding bits. This is not true, I believe, for desktop computers. However the standard left the door open for this possibility. Padding bits are used for spetial purposes (parity, signals, who knows), and not for holding mathematical values.
signed
types can have 3 manners of being represented: 1\'s complement, 2\'s complement, just sign-bit.
- The
char
types occupy just 1 byte, but 1 byte can have a number of bits different of 8 (but never less than 8).
However we can be sure about some details:
a. The char
types have not padding bits.
b. The unsigned
integer types are represented exactly as in binary form.
c. unsigned char
occupies exactly 1 byte, without padding bits, and there is not any trap representation because all the bits are used. Moreover, it represents a value without any ambiguity, following the binary format for integer numbers.
3. TYPE PUNNING vs TYPE REPRESENTATION
All these observations reveals that, if we try to do type punning with union
members having types different of unsigned char
, we could have a lot of ambiguity. It\'s not portable code and, in particular, we could have umpredictable behaviour of our program.
However, the standard allows this kind of access.
Even if we are sure about the specific manner in that every type is represented in our implementation, we could have a sequence of bits meaning nothing at all in other types (trap representation). We cannot do anything in this case.
4. THE SAFE CASE: unsigned char
The only safe manner of using type punning is with unsigned char
or well unsigned char
arrays (because we know that members of array objects are strictly contiguous and there is not any padding bytes when their size is computed with sizeof()
).
union {
TYPE data;
unsigned char type_punning[sizeof(TYPE)];
} xx;
Since we know that unsigned char
is represented in strict binary form, without padding bits, the type punning can be used here to take a look to the binary represention of the member data
.
This tool can be used to analyze how values of a given type are represented, in a particular implementation.
I am not able to see another safe and useful application of type punning under the standard specifications.
5. A COMMENT ABOUT CASTS...
If one wants to play with types, it\'s better to define your own transformation functions, or well just use casts. We can remember this simple example:
union {
unsigned char x;
double t;
} uu;
bool result;
uu.x = 7;
(uu.t == 7.0)? result = true: result = false;
// You can bet that result == false
uu.t = (double)(uu.x);
(uu.t == 7.0)? result = true: result = false;
// result == true
回答5:
There are (or at least were, back in C90) two modivations for
making this undefined behavior. The first was that a compiler
would be allowed to generate extra code which tracked what was
in the union, and generated a signal when you accessed the wrong
member. In practice, I don\'t think any one ever did (maybe
CenterLine?). The other was the optimization possibilities this
opened up, and these are used. I have used compilers which
would defer a write until the last possible moment, on the
grounds that it might not be necessary (because the variable
goes out of scope, or there is a subsequent write of a different
value). Logically, one would expect that this optimization
would be turned off when the union was visible, but it wasn\'t in
the earliest versions of Microsoft C.
The issues of type punning are complex. The C committee (back
in the late 1980\'s) more or less took the position that you
should use casts (in C++, reinterpret_cast) for this, and not
unions, although both techniques were widespread at the time.
Since then, some compilers (g++, for example) have taken the
opposite point of view, supporting the use of unions, but not
the use of casts. And in practice, neither work if it is not
immediately obvious that there is type-punning. This might be
the motivation behind g++\'s point of view. If you access
a union member, it is immediately obvious that there might be
type-punning. But of course, given something like:
int f(const int* pi, double* pd)
{
int results = *pi;
*pd = 3.14159;
return results;
}
called with:
union U { int i; double d; };
U u;
u.i = 1;
std::cout << f( &u.i, &u.d );
is perfectly legal according to the strict rules of the
standard, but fails with g++ (and probably many other
compilers); when compiling f
, the compiler assumes that pi
and pd
can\'t alias, and reorders the write to *pd
and the
read from *pi
. (I believe that it was never the intent that
this be guaranteed. But the current wording of the standard
does guarantee it.)
EDIT:
Since other answers have argued that the behavior is in fact
defined (largely based on quoting a non-normative note, taken
out of context):
The correct answer here is that of pablo1977: the standard makes
no attempt to define the behavior when type punning is involved.
The probable reason for this is that there is no portable
behavior that it could define. This does not prevent a specific
implementation from defining it; although I don\'t remember any
specific discussions of the issue, I\'m pretty sure that the
intent was that implementations define something (and most, if
not all, do).
With regards to using a union for type-punning: when the
C committee was developing C90 (in the late 1980\'s), there was
a clear intent to allow debugging implementations which did
additional checking (such as using fat pointers for bounds
checking). From discussions at the time, it was clear that the
intent was that a debugging implementation might cache
information concerning the last value initialized in a union,
and trap if you tried to access anything else. This is clearly
stated in §6.7.2.1/16: \"The value of at most one of the members
can be stored in a union object at any time.\" Accessing a value
that isn\'t there is undefined behavior; it can be assimilated to
accessing an uninitialized variable. (There were some
discussions at the time as to whether accessing a different
member with the same type was legal or not. I don\'t know what
the final resolution was, however; after around 1990, I moved on
to C++.)
With regards to the quote from C89, saying the behavior is
implementation-defined: finding it in section 3 (Terms,
Definitions and Symbols) seems very strange. I\'ll have to look
it up in my copy of C90 at home; the fact that it has been
removed in later versions of the standards suggests that its
presence was considered an error by the committee.
The use of unions which the standard supports is as a means to
simulate derivation. You can define:
struct NodeBase
{
enum NodeType type;
};
struct InnerNode
{
enum NodeType type;
NodeBase* left;
NodeBase* right;
};
struct ConstantNode
{
enum NodeType type;
double value;
};
// ...
union Node
{
struct NodeBase base;
struct InnerNode inner;
struct ConstantNode constant;
// ...
};
and legally access base.type, even though the Node was
initialized through inner
. (The fact that §6.5.2.3/6 starts
with \"One special guarantee is made...\" and goes on to
explicitly allow this is a very strong indication that all other
cases are meant to be undefined behavior. And of course, there
is the statement that \"Undefined behavior is otherwise indicated
in this International Standard by the words ‘‘undefined
behavior’’ or by the omission of any explicit definition of
behavior\" in §4/2; in order to argue that the behavior is not
undefined, you have to show where it is defined in the standard.)
Finally, with regards to type-punning: all (or at least all that
I\'ve used) implementations do support it in some way. My
impression at the time was that the intent was that pointer
casting be the way an implementation supported it; in the C++
standard, there is even (non-normative) text to suggest that the
results of a reinterpret_cast
be \"unsurprising\" to someone
familiar with the underlying architecture. In practice,
however, most implementations support the use of union for
type-punning, provided the access is through a union member.
Most implementations (but not g++) also support pointer casts,
provided the pointer cast is clearly visible to the compiler
(for some unspecified definition of pointer cast). And the
\"standardization\" of the underlying hardware means that things
like:
int
getExponent( double d )
{
return ((*(uint64_t*)(&d) >> 52) & 0x7FF) + 1023;
}
are actually fairly portable. (It won\'t work on mainframes, of
course.) What doesn\'t work are things like my first example,
where the aliasing is invisible to the compiler. (I\'m pretty
sure that this is a defect in the standard. I seem to recall
even having seen a DR concerning it.)