We often use fgetc
like this:
int c;
while ((c = fgetc(file)) != EOF)
{
// do stuff
}
Theoretically, if a byte in the file has the value of EOF
, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?
As far as I understand, fgetc
internally casts a byte read from the file to unsigned char
and then to int
, and returns it. This will work if the range of int
is greater than that of unsigned char
.
What happens if it's not (probably then sizeof(int)=1
)?
- Will
fgetc
read a legitimate data equal toEOF
from a file sometimes? - Will it alter the data it read from the file to avoid the single value
EOF
? - Will
fgetc
be an unimplemented function? - Will
EOF
be of another type, likelong
?
I could make my code fool-proof by an extra check:
int c;
for (;;)
{
c = fgetc(file);
if (feof(file))
break;
// do stuff
}
It is necessary if I want maximum portability?
The C specification says that
int
must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smallerint
is nonstandard.The C specification also says that
EOF
is a negativeint
constant and thatfgetc
returns "anunsigned char
converted to anint
" in the event of a successful read. Sinceunsigned char
can't have a negative value, the value ofEOF
can be distinguished from anything read from the stream.**See below for a loophole case in which this fails to hold.
Relevant standard text (from C99):
§5.2.4.2.1 Sizes of integer types
<limits.h>
:§7.19.1
<stdio.h>
- Introduction§7.19.7.1 The
fgets
functionIf
UCHAR_MAX
≤INT_MAX
, there is no problem: allunsigned char
values will be converted to non-negative integers, so they will be distinct from EOF.Now, there is a funny sort of loophole here: if a system has
UCHAR_MAX
>INT_MAX
, then a system is legally allowed to convert values greater thanINT_MAX
to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.Systems with
CHAR_BIT > 8
do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.Yes,
c = fgetc(file); if (feof(file))
does work for maximum portability. It works in general and also when theunsigned char
andint
have the same number of unique values. This occurs on rare platforms withchar
,signed char
,unsigned char
,short
,unsigned short
,int
,unsigned
all using the same bit width and width of range.Note that
feof(file))
is insufficient. Code should also check forferror(file)
.NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.
EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.
As for your specific questions:
fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.
Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).
fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)
EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.
As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)