File.ReadAllText with UTF-7 ignoring + characters

2019-09-15 20:31发布

问题:

I have a file on the disk that has been written by the program, with some data encoded in Json.

I am using C#'s File.ReadAllText(string path, Encoding encoding) to read it later. For unrelated reasons, we have to work with UTF-7.

Our lines then looks like this:

var content = File.ReadAllText(fileName, Encoding.UTF7);

It works fine, writing then reading, for basically everything we need. The only exception is the plus sign (+). If there is a + sign in our file, this code returns the entire string ignoring all of those. So

{ "commandValue": "testvalue + otherValue" }

turns into

{ "commandValue": "testvalue  otherValue" }

I have checked the file bytes, and the + sign is indeed char 0x2B, which is the right character in UTF-7 (and also the same char in UTF-8, not sure if it matters).

I can't figure out why they disappear when reading it.

For the sake of tests, I have tried reading it with

var content = File.ReadAllText(fileName, Encoding.UTF8);

and it worked fine. The chars did not disappear.

What could I possibly be doing wrong, and how could I make File.ReadAllText(fileName, Encoding.UTF7) not ignore those characters?

As of now, I haven't found another char that has this problem, but I obviously did not test all of them.

回答1:

The file is not being written using UTF7. The '+' is a special character in the UTF7 encoding scheme used to denote the start of a "modified base64" sequence. So, when the file is read as UTF7, the decoder sees the '+', expects a modified base64 sequence (but finds none), and then continues decoding the file as usual. The '+' is suppressed from the output as a result.

To fix the issue you're seeing, you could potentially try reading the file as UTF8, or you could update the code that writes the file to ensure that it uses UTF7 encoding.