How to process CSV lines with nul char in some ele

2019-09-16 19:04发布

问题:

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.

Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).

My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding. The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:

// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);

// Set the isUnicode flag and open the file according to that.
string mode{ "r" };     // init 
bool isUnicode = false; // pessimistic init

if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
    mode += ", ccs=UTF-8";
    isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF)     // UTF-16 BE BOM
      || (buf[0] == 0xFF && buf[1] == 0xFE))    // UTF-16 LE BOM
{
    mode += ", ccs=UNICODE";
    isUnicode = true;
}

// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);

After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:

    const int bufsize = 4096;
    wchar_t buf[bufsize];

    // Read the line from the input according to the isUnicode flag.
    while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
        : (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
    {
        // If the input is in Unicode, convert the buffer content
        // to the string in cp1250. Otherwise, do not touch it.
        string s;
        if (isUnicode)  s = ATL::CW2A(buf, 1250);
        else            s = reinterpret_cast<char*>(buf);
        ...
        // Now processing the characters of the `s` to form the output file
    }

It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.

How to solve the problem?

回答1:

The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.