Strict aliasing and writing int via char*

2019-09-11 11:13发布

In an old program I serialized a data structure to bytes, by allocating an array of unsigned char, and then converted ints by:

*((*int)p) = value;

(where p is the unsigned char*, and value is the value to be stored).

This worked fine, except when compiled on Sparc where it triggered exceptions due to accessing memory with improper alignment. Which made perfect sense because the data elements had varying sizes so p quickly became unaligned, and triggered the error when used to store an int value, where the underlying Sparc instructions require alignment.

This was quickly fixed (by writing out the value to the char-array byte-by-byte). But I'm a bit concerned about this because I've used this construction in many programs over the years without issue. But clearly I'm violating some C rule (strict aliasing?) and whereas this case was easily discovered, maybe the violations can cause other types of undefined behavior that is more subtle due to optimizing compilers etc. I'm also a bit puzzled because I believe I've seen constructions like this in lot of C code over the years. I'm thinking of hardware drivers that describe the data-structure exchanged by the hardware as structs (using pack(1) of course), and writing those to h/w registers etc. So it seems to be a common technique.

So my question is, is exactly what rule was violated by the above, and what would be the proper C way to realize the use-case (i.e. serializing data to an array of unsigned char). Of course custom serialization functions can be written for all functions to write it out byte-by-byte but it sounds cumbersome and not very efficient.

Finally, can ill effects (outside of alignment problems etc.) in general be expected through violation of this aliasing rule?

标签: c aliasing
2条回答
你好瞎i
2楼-- · 2019-09-11 11:44

Yes, your code violates strict aliasing rule. In C, only char* and its signed and unsigned counterparts are assumed to alias other types.

So, the proper way to do such raw serialization is to create an array on ints, and then treat it as unsigned char buffer.

int arr[] = { 1, 2, 3, 4, 5 };
unsigned char* rawData = (unsigned char*)arr;

You can memcpy, fwrite, or do other serialization of rawData, and it is absolutely valid.

Deserialization code may look like this:

int* arr = (int*)calloc(5, sizeof(int));
memcpy(arr, rawData, 5 * sizeof(int));

Sure, you should care of endianness, padding and other issues to implement reliable serialization.

查看更多
在下西门庆
3楼-- · 2019-09-11 11:49

It is compiler and platform specific, on how a struct is represented (layed out) in memory and whether or not the start address of a struct is aligned to a 1,2,4,8,... byte boundary. Therefore, you should not take any assumptions on the layout of your structs members.

On platforms, where your member types require specific alignment, padding bytes are added to the struct (which equals the statement I made above, that sizeof(struct Foo) >= the sum of its data member sizes). The padding...

Now, if you fwrite() or memcpy() a struct from one instance to another, on the same machine with the same compiler and settings (e.g. in the same program of yours), you will write both the data content and the padding bytes, added by the compiler. As long as you handle the whole struct, you can successfully round trip (as long as there are no pointer members inside the struct, at least).

What you cannot assume is, that you can cast smaller types (e.g. unsigned char ) to "larger types" (e.g. unsigned int) and memcpy between those in that direction, because unsigned int might require proper alignment on that target platform. Usually if you do that wrong, you see bus errors or alike.

malloc() in the most general case is the generic way to get heap-memory for any type of data. Be it a byte array or some struct, independent of its alignment requirements. There is no system existing, where you cannot struct Foo *ps = malloc(sizeof(struct Foo)). On platforms, where alignment is vital, malloc will not return unaligned addresses as it would break any code, trying to allocate memory for a struct. As malloc() is not psychic, it will also return "struct compatible aligned" pointers if you use it to allocate byte arrays.

Any form of "ad hoc" serialization like writing the whole struct is only a promising approach as long as you need not exchange the serialized data with other machines or other applications (or future versions of the same application where someone might have tinkered with compiler settings, related to alignment).

If you look for a portable and more reliable and robust solution, you should consider using one of the main stream serialization packages, one of which being the aforementioned Google protocol buffers.

查看更多
登录 后发表回答