Is C Endian neutral?

2019-09-22 08:58发布

Is C endian-neutral? Ok, another way of asking this question. I am currently translating a lot of code from C to Matlab on the same platform (PC). Do I need to care about endianess? Both are endian-neutral languages but C (not so sure), Matlab (pretty sure). By the same token I am also translating C to Python. So my question, has anybody in his experience, (translating from C to another endian-neutral language) met an unexpected problem with big/little endianness. Obviously we are only speaking about the core language. In this case I mentionned C99.

1条回答
淡お忘
2楼-- · 2019-09-22 10:03

First, some background and clarification:

As I mentioned in a comment to the original question, byte order is often confused with bit order. Endianness refers to byte order only. Bit order is only relevant in documentation and when data is sent via some serial connection.

In arithmetic, in base B (and 2 ≤ B ∈ ℕ), the i'th digit Di has value Di Bi. The least significant integral digit corresponds to i=0, i.e. D0. For binary, B = 2. For ordinary decimal numbers most humans prefer, B = 10.

(This works for all reals, not just integers. Most significant fractional digit, the first digit on the other side of the decimal point, is D-1, with more negative i's indicating less significant digits.)

Because 'bit' is a portmanteau of 'binary digit', we thus have a natural way of labeling bits, with bit 0 referring to the least significant (integer) bit (corresponding to value 1), bit 1 referring to the next one in significance (corresponding to value 2), and so on.

Some documentation for hardware using big-endian byte order insists on labeling the most significant bit in a word as "bit 0" (with bit numbers increasing from left to right -- contrary to most numeric representations, where digits grow more significant from right to left). This is just a labeling convention, as this convention does not follow the arithmetic rules. In fact, you need to know the width (number of bits) in that word, to even calculate the actual numeric value of such "bit 0"s.

Is C endian-neutral?

Yes, C (as in ISO C89, C99, and C11) is neutral with regards to byte order. The standards do not define any byte order; it is up to the implementation to decide. In practice, the compiler chooses the byte order suitable for the target architecture at compile time.

In theory, integer and floating-point types may very well have different byte order.

POSIX.1 adds networking support to C. Certain fields in network-related structures are defined to be in network byte order, most significant byte first. POSIX.1 provides htons(), htonl(), ntohs(), and ntohl() byteorder functions to convert from host to network byte order and vice versa.

In addition to network byte order (which is often called big-endian), little-endian byte order (least significant byte first) is also very common, for example on Intel/AMD architectures. The PDP-endian byte order (where four-byte values are stored second-most significant byte first, followed by the most significant byte, followed by the least significant, followed by the second-least significant byte) is nowadays rare.

Finally, C has been implemented on a large number of architectures, with byte orders covering all three mentioned above, without any byte order issues. That should be practical proof enough.

I am currently translating a lot of code from C to Matlab [or Python] on the same platform (PC). Do I need to care about endianess?

No, I don't see any reason for you to care about endianness when porting code between C, Matlab, Python, or just about any high-level language.

However:

Language being endian-neutral does not mean you don't need to care about endianness in your programs. Data byte order matters. It boils down to how your programs transfer -- read and write -- data; be that via in-memory structures (using shared memory, or between different programming languages via library bindings), to/from files, via network connections, or via pipes from/to other programs.

If your programs transfer data in some text-based format, then all you need to worry about is that format, and possibly the character set used -- I prefer UTF-8 (see utf8everywhere.org.

If your programs transfer data in binary, then you must understand that in binary, multi-byte values always have some specific byte order. It can be network byte order (or big-endian), little-endian, or native byte order for the current architecture. Just because your programming language is endian-neutral, does not mean you get to ignore the storage byte order.

For example, Matlab and Octave fread() support a fifth parameter that specifies the byte order used: native, ieee-be (IEEE big-endian), or ieee-le (IEEE little-endian). Python struct module pack and unpack functions default to native byte order and C alignment (padding), but you can use < or > as the first character in the format string to indicate little-endian or big-endian/network-endian byte order with no padding.

It is very common for C code to store binary data in native byte order. However, some C code does not. I prefer to store in native byte order, but also store known prototype values for each different basic numeric type, so that readers can trivially detect if they need to permute the byte order to interpret the code correctly. There are also various libraries and formats like NetCDF that may be utilized for creating portable binary data files.

The most important thing is to understand what the C code does, first.

I don't see why someone would want to port code from C to Matlab or Python, unless the C code was really poor to begin with -- in which case I'd just rewrite the logic, not port the existing code.

Have you met an unexpected problem with big/little endianness?

No, never when porting code between high-level languages.

Yes, when storing/retrieving binary data between different systems.

While not related to endianness, for multi-dimensional data, it is important to remember that Fortran and Matlab (and OpenGL matrices) use column-major order (each column being consecutive in memory), while C uses row-major order (each row being consecutive in memory).

查看更多
登录 后发表回答