fgetpos() behaviour depends on newline character

2019-02-16 23:57发布

Consider these two files:

file1.txt (Windows newline)

abc\r\n
def\r\n

file2.txt (Unix newline)

abc\n
def\n

I've noticed that for the file2.txt, the position obtained with fgetpos is not incremented correctly. I'm working on Windows.

Let me show you an example. The following code:

#include<cstdio>

void read(FILE *file)
{
    int c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fpos_t pos;
    fgetpos(file, &pos); // save the position
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fsetpos(file, &pos); // restore the position - should point to previous
    c = fgetc(file);     // character, which is not the case for file2.txt
    printf("%c (%d)\n", (char)c, c);
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

gives such result:

file1:
a (97)
b (98)
b (98)
c (99)


file2:
a (97)
b (98)
  (-1)
  (-1)

file1.txt works as expected, while file2.txt behaves strange. To explain what's wrong with it, I tried the following code:

void read(FILE *file)
{
    int c;
    fpos_t pos;
    while (1)
    {
        fgetpos(file, &pos);
        printf("pos: %d ", (int)pos);
        c = fgetc(file);
        if (c == EOF) break;
        printf("c: %c (%d)\n", (char)c, c);
    }
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

I got this output:

file1:
pos: 0 c: a (97)
pos: 1 c: b (98)
pos: 2 c: c (99)
pos: 3 c:
 (10)
pos: 5 c: d (100)
pos: 6 c: e (101)
pos: 7 c: f (102)
pos: 8 c:
 (10)
pos: 10

file2:
pos: 0 c: a (97) // something is going wrong here...
pos: -1 c: b (98)
pos: 0 c: c (99)
pos: 1 c:
 (10)
pos: 3 c: d (100)
pos: 4 c: e (101)
pos: 5 c: f (102)
pos: 6 c:
 (10)
pos: 8

I know that fpos_t is not meant to be interpreted by coder, because it's depending on implementation. However, the above example explains the problems with fgetpos/fsetpos.

How is it possible that the newline sequence affects the internal position of the file, even before it encounters that characters?

2条回答
家丑人穷心不美
2楼-- · 2019-02-17 00:34

I'm adding this as supporting information for teppic's answer:

When dealing with a FILE* that has been opened as text instead of binary, the fgetpos() function in VC++ 11 (VS 2012) may (and does for your file2.txt example) end up in this stretch of code:

// ...

if (_osfile(fd) & FTEXT) {
        /* (1) If we're not at eof, simply copy _bufsiz
           onto rdcnt to get the # of untranslated
           chars read. (2) If we're at eof, we must
           look through the buffer expanding the '\n'
           chars one at a time. */

        // ...

        if (_lseeki64(fd, 0i64, SEEK_END) == filepos) {

            max = stream->_base + rdcnt;
            for (p = stream->_base; p < max; p++)
                if (*p == '\n')                     // <---
                    /* adjust for '\r' */           // <---
                    rdcnt++;                        // <---

// ...

It assumes that any \n character in the buffer was originally a \r\n sequence that had been normalized when the data was read into the buffer. So there are times when it tries to account for that (now missing) \r character that it believes previous processing of the file had removed from the buffer. This particular adjustment happens when you're near the end of the file; however there are other similar adjustments to account for the removed \r bytes in the fgetpos() handling.

查看更多
放荡不羁爱自由
3楼-- · 2019-02-17 00:42

I would say the problem is probably caused by the second file confusing the implementation, since it's being opened in text mode, but it doesn't follow the requirements.

In the standard,

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character

Your second file stream contains no valid newline characters (since it looks for \r\n to convert to the newline character internally). As a result, the implementation may not understand the line length properly, and get hopelessly confused when you try to move about in it.

Additionally,

Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment.

Bear in mind that the library will not just read each byte from the file as you call fgetc - it will read the entire file (for one so small) into the stream's buffer and operate on that.

查看更多
登录 后发表回答