Why would redirection work where piping fails?

2019-01-26 01:08发布

问题:

In theory, these two command-lines should be equivalent:

1

type tmp.txt | test.exe

2

test.exe < tmp.txt

I have a process involving #1 that, for many years, worked just fine; at some point within the last year, we started to compile the program with a newer version of Visual Studio, and it now fails due to malformed input (see below). But #2 succeeds (no exception and we see expected output). Why would #2 succeed where #1 fails?

I've been able to reduce test.exe to the program below. Our input file has exactly one tab per line and uniformly uses CR/LF line endings. So this program should never write to stderr:

#include <iostream>
#include <string>

int __cdecl main(int argc, char** argv)
{
    std::istream* pIs = &std::cin;
    std::string line;

    int lines = 0;
    while (!(pIs->eof()))
    {
        if (!std::getline(*pIs, line))
        {
            break;
        }

        const char* pLine = line.c_str();
        int tabs = 0;
        while (pLine)
        {
            pLine = strchr(pLine, '\t');
            if (pLine)
            {
                // move past the tab
                pLine++;
                tabs++;
            }
        }

        if (tabs > 1)
        {
            std::cerr << "We lost a linebreak after " << lines << " good lines.\n";
            lines = -1;
        }

        lines++;
    }

    return 0;
}

When run via #1, I get the following output, with the same numbers every time (in each case, it's because getline has returned two concatenated lines with no intervening linebreak); when run via #2, there's (correctly) no output:

We lost a linebreak after 8977 good lines.
We lost a linebreak after 1468 good lines.
We lost a linebreak after 20985 good lines.
We lost a linebreak after 6982 good lines.
We lost a linebreak after 1150 good lines.
We lost a linebreak after 276 good lines.
We lost a linebreak after 12076 good lines.
We lost a linebreak after 2072 good lines.
We lost a linebreak after 4576 good lines.
We lost a linebreak after 401 good lines.
We lost a linebreak after 6428 good lines.
We lost a linebreak after 7228 good lines.
We lost a linebreak after 931 good lines.
We lost a linebreak after 1240 good lines.
We lost a linebreak after 2432 good lines.
We lost a linebreak after 553 good lines.
We lost a linebreak after 6550 good lines.
We lost a linebreak after 1591 good lines.
We lost a linebreak after 55 good lines.
We lost a linebreak after 2428 good lines.
We lost a linebreak after 1475 good lines.
We lost a linebreak after 3866 good lines.
We lost a linebreak after 3000 good lines.

回答1:

This turns out to be a known issue:

The bug is in fact in the lower-level _read function, which the stdio library functions (including both fread and fgets) use to read from a file descriptor.

The bug in _read is as follows: If…

  1. you are reading from a text mode pipe,
  2. you call _read to read N bytes,
  3. _read successfully reads N bytes, and
  4. the last byte read is a carriage return (CR) character,

then the _read function will complete the read successfully but will return N-1 instead of N. The CR or LF character at the end of the result buffer is not counted in the return value.

In the specific issue reported in this bug, fread calls _read to fill the stream buffer. _read reports that it filled N-1 bytes of the buffer and the final CR or LF character is lost.

The bug is fundamentally timing-sensitive because whether _read can successfully read N bytes from the pipe depends on how much data has been written to the pipe. Changing the buffer size or changing when the buffer is flushed may reduce the likelihood of the problem, but it won’t necessarily work around the problem in 100% of cases.

There are several possible workarounds:

  1. use a binary pipe and do text mode CRLF => LF translation manually on the reader side. This is not particularly difficult to do (scan the buffer for CRLF pairs; replace them with a single LF).
  2. call ReadFile with _osfhnd(fh), bypassing the CRT’s I/O library on the reader side entirely (though this would also require manual text mode translation, since the OS won’t do text mode translation for you)

We have fixed this bug for the next update to the Universal CRT. Note that the Universal CRT is an operating system component and is serviced independently from the Visual C++ libraries. The next update to the Universal CRT will probably be around the same timeframe as the Windows 10 Anniversary Update this summer.