Accessing C union members via pointers

2019-02-08 21:26发布

问题:

Does accessing union members via a pointer, as in the example below, result in undefined behavior in C99? The intent seems clear enough, but I know that there are some restrictions regarding aliasing and unions.

union { int i; char c; } u;

int  *ip = &u.i;
char *ic = &u.c;

*ip = 0;
*ic = 'a';
printf("%c\n", u.c);

回答1:

It is unspecified (subtly different from undefined) behaviour to access a union by any element other than the one that was last written. That's detailed in C99 annex J:

The following are unspecified:
   :
   The value of a union member other than the last one stored into (6.2.6.1).

However, since you are writing to c via the pointer, then reading c, this particular example is well defined. It does not matter how you write to the element:

u.c = 'a';        // direct write.
*(&(u.c)) = 'a';  // variation on yours, writing through element pointer.
(&u)->c = 'a';    // writing through structure pointer.

There is one issue that has been raised in comments which seems to contradict that, at least seemingly. User davmac provides sample code:

// Compile with "-O3 -std=c99" eg:
//  clang -O3 -std=c99 test.c
//  gcc -O3 -std=c99 test.c
// On clang v3.5.1, output is "123"
// On gcc 4.8.4, output is "1073741824"
//
// Different outputs, so either:
// * program invokes undefined behaviour; both compilers are correct OR
// * compiler vendors interpret standard differently OR
// * one compiler or the other has a bug

#include <stdio.h>

union u
{
    int i;
    float f;
};

int someFunc(union u * up, float *fp)
{
    up->i = 123;
    *fp = 2.0;     // does this set the union member?
    return up->i;  // then this should not return 123!
}

int main(int argc, char **argv)
{
    union u uobj;
    printf("%d\n", someFunc(&uobj, &uobj.f));
    return 0;
}

which outputs different values on different compilers. However, I believe that this is because it is actually violating the rules here because it writes to member f then reads member i and, as shown in Annex J, that's unspecified.

There is a footnote 82 in 6.5.2.3 which states:

If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type.

However, since this seems to go against the Annex J comment and it's a footnote to the section dealing with expressions of the form x.y, it may not apply to accesses via a pointer.

One of the major reasons why aliasing is supposed to be strict is to allow the compiler more scope for optimisation. To that end, the standard dictates that treating memory of a different type to that written is unspecified.

By way of example, consider the function provided:

int someFunc(union u * up, float *fp)
{
    up->i = 123;
    *fp = 2.0;     // does this set the union member?
    return up->i;  // then this should not return 123!
}

The implementation is free to assume that, because you're not supposed to alias memory, up->i and *fp are two distinct objects. So it's free to assume that you're not changing the value of up->i after you set it to 123 so it can simply return 123 without looking at the actual variable contents again.

If instead, you changed the pointer setting statement to:

up->f = 2.0;

then that would make footnote 82 applicable and the returned value would be a re-interpretation of the float as an integer.

The reason why I don't think that's an issue for the question is because your writing then reading the same type, hence aliasing rules don't come into play.


It's interesting to note that the unspecified behaviour is caused not by the function itself, but by calling it thus:

union u up;
int x = someFunc (&u, &(up.f)); // <- aliasing here

If you were instead to call it so:

union u up;
float down;
int x = someFunc (&u, &down); // <- no aliasing

that would not be a problem.



回答2:

No, it won't but you need to keep track of what the last type you put into the union was. If I were to reverse the order of your int and char assignments it would be a very different story:

#include <stdio.h>

union { int i; char c; } u;

int main()
{
    int  *ip = &u.i;
    char *ic = &u.c;

    *ic = 'a';
    *ip = 123456;

    printf("%c\n", u.c); /* trying to print a char even though 
                            it's currently storing an int,
                            in this case it prints '@' on my machine */

    return 0;
}

EDIT: Some explanation on why it may have printed 64 ('@').

The binary representation of 123456 is 0001 1110 0010 0100 0000.

For 64 it is 0100 0000.

You can see that the first 8 bits are identical and since printf is instructed to read the first 8 bits, it prints only as much.



回答3:

The only reason it's not UB is because you were lucky/unlucky enough to choose char for one of the types, and character types can alias anything in C. If the types were, for example, int and float, the accesses via pointers would be aliasing violations and thus undefined behavior. For direct access via the union, the behavior was deemed well defined as part of the interpretation for Defect Report 283:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

Of course, you still need to ensure that the representation of the type used for writing can also be interpreted as a valid (non-trap) representation for the type later used for reading.