-->

Different behavior of XmlPullParser.getInputEncodi

2020-03-31 02:19发布

问题:

I am developing a new feature for my android app to enable data backup and restore. I am using XML files to backup data. This is a piece of code that sets encoding for an output file:

XmlSerializer serializer = Xml.newSerializer();
FileWriter fileWriter = new FileWriter(file, false);
serializer.setOutput(fileWriter);
serializer.startDocument("UTF-8", true);
[... Write data to the file....]

This is how I try to import data from an XML file. First, I check if the encoding is correct:

XmlPullParser parser = Xml.newPullParser();
FileReader reader = new FileReader(file);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
parser.setInput(reader);
if(!"UTF-8".equals(parser.getInputEncoding())) {
    throw new IOException("Incorrect file encoding");
}
[... Read data from the file....]

And here I'm running into a problem. This code works fine on Android 2.3.3(both a device and an emulator), the encoding is correctly detected as "UTF-8". But on API11+ versions(Honeycomb, ICS, JB) the exception is thrown. When I run this in debug mode I can see that parser.getInputEncoding() returns null. I checked the actual XML files produced on 2.3.3 and later versions and they have exactly the same headers: <?xml version='1.0' encoding='UTF-8' standalone='yes' ?>. Why does getInputEncoding() return null on API11+?

Additional findings:

I have discovered that there is a way to correctly detect file encoding on API11+ devices using FileInputStream instead of FileReader like this:

XmlPullParser parser = Xml.newPullParser();
FileInputStream stream = new FileInputStream(file);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
parser.setInput(stream, null);
if(!"UTF-8".equals(parser.getInputEncoding())) {
    throw new IOException("Incorrect file encoding");
}
[... Read data from the file....]

In this case getInputEncoding() properly detects UTF-8 encoding on API11+ emulators and devices, but it returns null on 2.3.3. So for now I can insert a fork in code to use FileReader on API11+ and FileInputStream on pre-API11:

if(Build.VERSION.SDK_INT >= Build.VERSION_CODES.HONEYCOMB) {
    parser.setInput(stream, null);
} else {
    parser.setInput(reader);
}

But what's the proper way to check encoding with XmlPullParser.getInputEncoding()? Why are different versions of Android behave differently depending on which one I use: FileInputStream or FileReader?

回答1:

After some more trial and error, I've finally managed to figure out what's going on. So despite the fact that the documentation says:

Historically Android has had two implementations of this interface: KXmlParser via XmlPullParserFactory.newPullParser(). ExpatPullParser, via Xml.newPullParser().

Either choice is fine. The example in this section uses ExpatPullParser, via Xml.newPullParser().

The reality is, that on older APIs, such as 2.3.3 Xml.newPullParser() returns ExpatPullParser object. While on Ice Cream Sandwich and up it returns KXmlParser object. And as we can see from this blog post, android developers knew about this since December 2011:

In Ice Cream Sandwich we changed Xml.newPullParser() to return a KxmlParser and deleted our ExpatPullParser class.

...but never bothered to update the official documentation.

So how do you retrieve KXmlParser object on APIs before Ice Cream Sandwich? Simple:

XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();

...in fact this works on all versions of android, new and old. Then you supply a FileInputStream to your parser's setInput() method, leaving default encoding null:

FileInputStream stream = null;
stream = new FileInputStream(file);
parser.setInput(stream, null);

After this, on APIs 11 and higher you can call parser.getInputEncoding() right away and it will return the correct encoding. But on pre-API11 versions, it will return null unless you call parser.next() first, as @Esailija correctly pointed out in his answer. Interestingly enough, on API11+ calling next() doesn't have any negative effect whatsoever, so you may safely use this code on all versions:

parser.next();
String encoding = parser.getInputEncoding();

And this will correctly return "UTF-8".



回答2:

FileReader and other readers don't detect encoding. They just use the platform default encoding which can be UTF-8 by coincidence. It has no relation to the actual encoding of the file.

You cannot detect XML file encoding until you read it enough to see the encoding attribute.

From getInputEncoding() documentation

if inputEncoding is null and the parser supports the encoding detection feature, it must return the detected encoding

And:

If setInput(Reader) was called, null is returned.

So it appears that pre 11 doesn't support detection which is enabled by using setInput(is, null). I don't know how you are getting "UTF-8" when using setInput(reader) as the documentation says it should return null.

Then:

After first call to next if XML declaration was present this method will return encoding declared.

So in pre 11, you could try calling .next() intially before calling .getInputEncoding