Understanding undefined behavior for a binary stre

2020-02-13 08:04发布

问题:

The C spec has an interesting footnote (#268 C11dr §7.21.3 9)

"Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."

Does this ever apply to binary streams reading a file? (as from a physical device)

IMO, a binary file on a disk is just a sea of bytes. It seems to me that a binary file could not have state-dependent encoding as it is a binary file. I'm fuzzy on the concept of "binary wide-oriented streams" and if that even could apply to disk I/O.

I see that calling fseek(file, 0, SEEK_END) on a serial stream like a com port or maybe stdin may not get to the true end as the end is yet to be determined. Thus the narrowing of the question to physical files.


[edit] Answer: A concern with older (maybe up to late 1980s). Presently in 2014, Windows, POSIT-specific and non-exotic others: not a problem.

@Shafik Yaghmour provides a good reference in Using fseek and ftell to determine the size of a file has a vulnerability?. There @Jerry Coffin discusses CP/M as binary files not always having a precise length. (128-byte records per wiki).

Thanks to @Keith Thompson answer for the meat of the answer.

Together this explains the specs's "(because of possible trailing null characters)" comment.

回答1:

Binary files are going to be sequences of 8-bit bytes, with an exact specified size, on any system you're likely to use. But not all systems store files that way, and the C standard is carefully designed to allow portability to systems with unusual characteristics.

For example, a conforming C implementation might run on an operating system that stores files as sequences of 512-byte blocks, with no indication of how many bytes of the final block are significant. On such a system, when a binary file is created, the OS might pad the remainder of the final block with zero bytes. When you read from such a file, the padding bytes might either appear in the input (even though they were never explicitly written to the file), or they might be ignored (even though the program that created the file might have written them explicitly).

If you're reading from a non-seekable stream (for example keyboard input), then fseek(file, 0, SEEK_END) won't just give you a bad result, it will indicate failure by returning a non-zero result. (On POSIX-compliant systems, it returns -1 and sets errno; ISO C doesn't require that.)

On most systems, fseek(file, 0, SEEK_END) on a binary file will either seek to the actual end of the file (a position determined by exactly how many bytes were written to the file), or return a clear failure indication. If you're using POSIX-specific features anyway, you can safely assume this behavior; you can probably make the same assumption for Windows and a number of other systems. If you want your code to be 100% portable to exotic systems, you shouldn't assume that binary files won't be padded with extra zero bytes.