Is it possible to confuse EOF with a normal byte v

2019-01-20 11:37发布

We often use fgetc like this:

int c;
while ((c = fgetc(file)) != EOF)
{
    // do stuff
}

Theoretically, if a byte in the file has the value of EOF, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?

As far as I understand, fgetc internally casts a byte read from the file to unsigned char and then to int, and returns it. This will work if the range of int is greater than that of unsigned char.

What happens if it's not (probably then sizeof(int)=1)?

  • Will fgetc read a legitimate data equal to EOF from a file sometimes?
  • Will it alter the data it read from the file to avoid the single value EOF?
  • Will fgetc be an unimplemented function?
  • Will EOF be of another type, like long?

I could make my code fool-proof by an extra check:

int c;
for (;;)
{
    c = fgetc(file);
    if (feof(file))
        break;
    // do stuff
}

It is necessary if I want maximum portability?

3条回答
Ridiculous、
2楼-- · 2019-01-20 12:06

The C specification says that int must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int is nonstandard.

The C specification also says that EOF is a negative int constant and that fgetc returns "an unsigned char converted to an int" in the event of a successful read. Since unsigned char can't have a negative value, the value of EOF can be distinguished from anything read from the stream.*

*See below for a loophole case in which this fails to hold.


Relevant standard text (from C99):

  • §5.2.4.2.1 Sizes of integer types <limits.h>:

    [The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.

    [...]

    • minimum value for an object of type int

      INT_MIN -32767

    • maximum value for an object of type int

      INT_MAX +32767

  • §7.19.1 <stdio.h> - Introduction

    EOF ... expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream

  • §7.19.7.1 The fgets function

    If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined)

If UCHAR_MAXINT_MAX, there is no problem: all unsigned char values will be converted to non-negative integers, so they will be distinct from EOF.

Now, there is a funny sort of loophole here: if a system has UCHAR_MAX > INT_MAX, then a system is legally allowed to convert values greater than INT_MAX to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.

Systems with CHAR_BIT > 8 do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.

查看更多
欢心
3楼-- · 2019-01-20 12:20

Yes, c = fgetc(file); if (feof(file)) does work for maximum portability. It works in general and also when the unsigned char and int have the same number of unique values. This occurs on rare platforms with char, signed char, unsigned char, short, unsigned short, int, unsigned all using the same bit width and width of range.

Note that feof(file)) is insufficient. Code should also check for ferror(file).

int c;
for (;;)
{
    c = fgetc(file);
    if (c == EOF) {
      if (feof(file)) break;
      if (ferror(file)) break;
    }
    // do stuff
}
查看更多
老娘就宠你
4楼-- · 2019-01-20 12:21

NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.

EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.

As for your specific questions:

  • fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.

    Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).

  • fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)

  • EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.

  • As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)

查看更多
登录 后发表回答