I am developing a new feature for my android app to enable data backup and restore. I am using XML files to backup data. This is a piece of code that sets encoding for an output file:
XmlSerializer serializer = Xml.newSerializer();
FileWriter fileWriter = new FileWriter(file, false);
serializer.setOutput(fileWriter);
serializer.startDocument("UTF-8", true);
[... Write data to the file....]
This is how I try to import data from an XML file. First, I check if the encoding is correct:
XmlPullParser parser = Xml.newPullParser();
FileReader reader = new FileReader(file);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
parser.setInput(reader);
if(!"UTF-8".equals(parser.getInputEncoding())) {
throw new IOException("Incorrect file encoding");
}
[... Read data from the file....]
And here I'm running into a problem. This code works fine on Android 2.3.3(both a device and an emulator), the encoding is correctly detected as "UTF-8". But on API11+ versions(Honeycomb, ICS, JB) the exception is thrown. When I run this in debug mode I can see that parser.getInputEncoding() returns null
. I checked the actual XML files produced on 2.3.3 and later versions and they have exactly the same headers: <?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
. Why does getInputEncoding() return null on API11+?
Additional findings:
I have discovered that there is a way to correctly detect file encoding on API11+ devices using FileInputStream
instead of FileReader
like this:
XmlPullParser parser = Xml.newPullParser();
FileInputStream stream = new FileInputStream(file);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
parser.setInput(stream, null);
if(!"UTF-8".equals(parser.getInputEncoding())) {
throw new IOException("Incorrect file encoding");
}
[... Read data from the file....]
In this case getInputEncoding() properly detects UTF-8 encoding on API11+ emulators and devices, but it returns null on 2.3.3. So for now I can insert a fork in code to use FileReader on API11+ and FileInputStream on pre-API11:
if(Build.VERSION.SDK_INT >= Build.VERSION_CODES.HONEYCOMB) {
parser.setInput(stream, null);
} else {
parser.setInput(reader);
}
But what's the proper way to check encoding with XmlPullParser.getInputEncoding()? Why are different versions of Android behave differently depending on which one I use: FileInputStream or FileReader?
FileReader
and other readers don't detect encoding. They just use the platform default encoding which can be UTF-8 by coincidence. It has no relation to the actual encoding of the file.You cannot detect XML file encoding until you read it enough to see the
encoding
attribute.From
getInputEncoding()
documentationAnd:
So it appears that pre 11 doesn't support detection which is enabled by using
setInput(is, null)
. I don't know how you are getting"UTF-8"
when usingsetInput(reader)
as the documentation says it should returnnull
.Then:
So in pre 11, you could try calling
.next()
intially before calling.getInputEncoding
After some more trial and error, I've finally managed to figure out what's going on. So despite the fact that the documentation says:
The reality is, that on older APIs, such as 2.3.3
Xml.newPullParser()
returnsExpatPullParser
object. While on Ice Cream Sandwich and up it returnsKXmlParser
object. And as we can see from this blog post, android developers knew about this since December 2011:...but never bothered to update the official documentation.
So how do you retrieve
KXmlParser
object on APIs before Ice Cream Sandwich? Simple:...in fact this works on all versions of android, new and old. Then you supply a FileInputStream to your parser's setInput() method, leaving default encoding
null
:After this, on APIs 11 and higher you can call parser.getInputEncoding() right away and it will return the correct encoding. But on pre-API11 versions, it will return null unless you call parser.next() first, as @Esailija correctly pointed out in his answer. Interestingly enough, on API11+ calling next() doesn't have any negative effect whatsoever, so you may safely use this code on all versions:
And this will correctly return "UTF-8".