Difference between scanf() and strtol() / strtod()

2019-01-14 13:07发布

Note: I completely reworked the question to more properly reflect what I am setting the bounty for. Please excuse any inconsistencies with already-given answers this might have created. I did not want to create a new question, as previous answers to this one might be helpful.


I am working on implementing a C standard library, and am confused about one specific corner of the standard.

The standard defines the number formats accepted by the scanf function family (%d, %i, %u, %o, %x) in terms of the definitions for strtol, strtoul, and strtod.

The standard also says that fscanf() will only put back a maximum of one character into the input stream, and that therefore some sequences accepted by strtol, strtoul and strtod are unacceptable to fscanf (ISO/IEC 9899:1999, footnote 251).

I tried to find some values that would exhibit such differences. It turns out that the hexadecimal prefix "0x", followed by a character that is not a hexadecimal digit, is one such case where the two function families differ.

Funny enough, it became apparent that no two available C libraries seem to agree on the output. (See test program and example output at the end of this question.)

What I would like to hear is what would be considered standard-compliant behaviour in parsing "0xz"?. Ideally citing the relevant parts from the standard to make the point.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main()
{
    int i, count, rc;
    unsigned u;
    char * endptr = NULL;
    char culprit[] = "0xz";

    /* File I/O to assert fscanf == sscanf */
    FILE * fh = fopen( "testfile", "w+" );
    fprintf( fh, "%s", culprit );
    rewind( fh );

    /* fscanf base 16 */
    u = -1; count = -1;
    rc = fscanf( fh, "%x%n", &u, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, u, count );
    rewind( fh );

    /* strtoul base 16 */
    u = strtoul( culprit, &endptr, 16 );
    printf( "strtoul:             result %2d, consumed %d\n", u, endptr - culprit );

    puts( "" );

    /* fscanf base 0 */
    i = -1; count = -1;
    rc = fscanf( fh, "%i%n", &i, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, i, count );
    rewind( fh );

    /* strtol base 0 */
    i = strtol( culprit, &endptr, 0 );
    printf( "strtoul:             result %2d, consumed %d\n", i, endptr - culprit );

    fclose( fh );
    return 0;
}

/* newlib 1.14

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0
*/

/* glibc-2.8

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1
*/

/* Microsoft MSVC

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 0

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 0
*/

/* IBM AIX

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 1

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 1
*/

8条回答
啃猪蹄的小仙女
2楼-- · 2019-01-14 13:55

For the input to the scanf() functions and also for strtol() functions, in Sec. 7.20.1.4 P7 indicates: If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer. Also you must be considering that the rules of parsing those tokens which are defined under the rules of Sec. 6.4.4 Constants, rule that is pointed in Sec. 7.20.1.4 P5.

The rest of the behavior, such as the errno value, should be implementation specific. For example at my FreeBSD box I got EINVAL and ERANGE values and under Linux the same happens, where the standard referrers only to the ERANGE errno value.

查看更多
beautiful°
3楼-- · 2019-01-14 13:58

Communication with Fred J. Tydeman, Vice-char of PL22.11 (ANSI "C"), on comp.std.c shed some light on this:

fscanf

An input item is defined as the longest sequence of input characters [...] which is, or is a prefix of, a matching input sequence. (7.19.6.2 P9)

This makes "0x" the longest sequence that is a prefix of a matching input sequence. (Even with %i conversion, as the hex "0x" is a longer sequence than the decimal "0".)

The first character, if any, after the input item remains unread. (7.19.6.2 P9)

This makes fscanf read the "z", and put it back as not-matching (honoring the one-character pushback limit of footnote 251)).

If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. (7.19.6.2 P10)

This makes "0x" fail to match, i.e. fscanf should assign no value, return zero (if the %x or %i was the first conv. specifier), and leave "z" as the first unread character in the input stream.

strtol

The definition of strtol (and strtoul) differs in one crucial point:

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. (7.20.1.4 P4, emphasis mine)

Which means that strtol should look for the longest valid sequence, in this case the "0". It should point endptr to the "x", and return zero as result.

查看更多
登录 后发表回答