Note: I completely reworked the question to more properly reflect what I am setting the bounty for. Please excuse any inconsistencies with already-given answers this might have created. I did not want to create a new question, as previous answers to this one might be helpful.
I am working on implementing a C standard library, and am confused about one specific corner of the standard.
The standard defines the number formats accepted by the scanf
function family (%d, %i, %u, %o, %x) in terms of the definitions for strtol
, strtoul
, and strtod
.
The standard also says that fscanf()
will only put back a maximum of one character into the input stream, and that therefore some sequences accepted by strtol
, strtoul
and strtod
are unacceptable to fscanf
(ISO/IEC 9899:1999, footnote 251).
I tried to find some values that would exhibit such differences. It turns out that the hexadecimal prefix "0x", followed by a character that is not a hexadecimal digit, is one such case where the two function families differ.
Funny enough, it became apparent that no two available C libraries seem to agree on the output. (See test program and example output at the end of this question.)
What I would like to hear is what would be considered standard-compliant behaviour in parsing "0xz"?. Ideally citing the relevant parts from the standard to make the point.
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
int main()
{
int i, count, rc;
unsigned u;
char * endptr = NULL;
char culprit[] = "0xz";
/* File I/O to assert fscanf == sscanf */
FILE * fh = fopen( "testfile", "w+" );
fprintf( fh, "%s", culprit );
rewind( fh );
/* fscanf base 16 */
u = -1; count = -1;
rc = fscanf( fh, "%x%n", &u, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, u, count );
rewind( fh );
/* strtoul base 16 */
u = strtoul( culprit, &endptr, 16 );
printf( "strtoul: result %2d, consumed %d\n", u, endptr - culprit );
puts( "" );
/* fscanf base 0 */
i = -1; count = -1;
rc = fscanf( fh, "%i%n", &i, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, i, count );
rewind( fh );
/* strtol base 0 */
i = strtol( culprit, &endptr, 0 );
printf( "strtoul: result %2d, consumed %d\n", i, endptr - culprit );
fclose( fh );
return 0;
}
/* newlib 1.14
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
*/
/* glibc-2.8
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
*/
/* Microsoft MSVC
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 0
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 0
*/
/* IBM AIX
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 1
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 1
*/
To summarize what should happen according to the standard when parsing numbers:
fscanf()
succeeds, the result must be identical to the one obtained viastrto*()
in contrast to
strto*()
,fscanf()
fails ifaccording to the definition of
fscanf()
is notaccording to the definition of
strto*()
This is somewhat ugly, but a necessary consequence of the requirement that
fscanf()
should be greedy, but can't push back more than one character.Some library implementators opted for differing behaviour. In my opinion
strto*()
fail to make results consistent is stupid (bad mingw)fscanf()
accepts all values accepted bystrto*()
violates the standard, but is justified (hurray for newlib if they didn't botchstrto*()
:()Answer obsolete after rewrite of question. Some interesting links in the comments though.
After testing all combinations of conversion specifiers and input variations I could think of, I can say that it is correct that the two function families do not give identical results. (At least in glibc, which is what I have available for testing.)
The difference appears when three circumstances meet:
"%i"
or"%x"
(allowing hexadecimal input)."0x"
hexadecimal prefix.Example code:
Output:
This confuses me. Obviously
sscanf()
does not bail out at the'x'
, or it wouldn't be able to parse any"0x"
prefixed hexadecimals. So it has read the'z'
and found it non-matching. But it decides to use only the leading"0"
as value. That would mean pushing the'z'
and the'x'
back. (Yes I know thatsscanf()
, which I used here for easy testing, does not operate on a stream, but I strongly assume they made all...scanf()
functions behave identically for consistency.)So... one-char
ungetc()
doesn't really to be the reason, here... ?:-/Yes, results differ. I still cannot explain it properly, though... :-(
According to the C99 spec, the
scanf()
family of functions parses integers the same way as thestrto*()
family of functions. For example, for the conversion specifierx
this reads:So if
sscanf()
andstrtoul()
give different results, the libc implementation doesn't conform.What the expected results of you sample code should be is a bit unclear, though:
strtoul()
accepts an optional prefix of0x
or0X
ifbase
is16
, and the spec readsFor the string
"0xz"
, in my opinion the longest initial subsequence of expected form is"0"
, so the value should be0
and theendptr
argument should be set tox
.mingw-gcc 4.4.0 disagrees and fails to parse the string with both
strtoul()
andsscanf()
. The reasoning could be that the longest initial subsequence of expected form is"0x"
- which is not a valid integer literal, so no parsing is done.I think this interpretation of the standard is wrong: A subsequence of expected form should always yield a valid integer value (if out of range, the
MIN
/MAX
values are returned anderrno
is set toERANGE
).cygwin-gcc 3.4.4 (which uses newlib as far as I know) will also not parse the literal if
strtoul()
is used, but parses the string according to my interpretation of the standard withsscanf()
.Beware that my interpretation of the standard is prone to your initital problem, ie that the standard only guarantees to be able to
ungetc()
once. To decide if the0x
is part of the literal, you have to read ahead two characters: thex
and the following character. If it's no hex character, they have to be pushed back. If there are more tokens to parse, you can buffer them and work around this problem, but if it's the last token, you have toungetc()
both characters.I'm not really sure what
fscanf()
should do ifungetc()
fails. Maybe just set the stream's error indicator?I am not sure how implementing scanf() may be related to ungetc(). scanf() can use up all bytes in the stream buffer. ungetc() simply pushes a byte to the end of buffer and the offset is also changed.
If the input is "100", the output is "100, 9". I do not see how scanf() and ungetc() may interfere with each other. Sorry if I added a naive comment.
I am not sure I understand the question, but for one thing scanf() is supposed to handle EOF. scanf() and strtol() are different kinds of beasts. Maybe you should compare strtol() and sscanf() instead?
I don't believe the parsing is allowed to produce different results. The Plaugher reference is just pointing out that the
strtol()
implementation might be a different, more efficient version as it has complete access to the entire string.