I'm trying to parse an XML file using DocumentBuilderFactory as follows:
DocumentBuilderFactory ndsParserFactory = DocumentBuilderFactory.newInstance( );
ndsParserFactory.setNamespaceAware( true );
DocumentBuilder ndsParser = ndsParserFactory.newDocumentBuilder( );
Document ndsDocument = ndsParser.parse( ndsFileInputStream );
where ndsFileInputStream is an InputStream wrapping the file containing the XML.
I get an exception when the file contains a Unicode character such as Δ. When I strip out the line containing the offending character, the parsing works just fine.
The file contains the characteristic <?xml version="1.0" encoding="utf-8"?>
header.
I'm wondering if I'm neglecting to configure the DocumentBuilderFactory (or DocumentBuilder) instance properly in order to handle the Δ character.
Edit (from comments):
Full disclosure: This is Android, and I'm including XML files (with an NDS file extension) as assets in my Android app. I access them via the AssetManager, which has a handy-dandy method for opening an asset file into an InputStream, which I then pass to the parse method of my DocumentBuilder. – d weld 16 hours ago
I noticed that the assets folder uses an encoding of CP1252 by default for its contents. So I changed that to UTF8. No luck. Then I removed the BOM from one of the NDS files (per link) and tried again. No luck. I'm thinking that the APK file (which is compressed like a ZIP file) is somehow mangling the non-ASCII XML. I think I'll have to resort to getting the NDS files onto the Android device by other means...
Are you sure the file is really written as UTF-8? Obviously you can open it in some editor and it shows the text correctly, but it could just be making a good guess as the encoding.
The other thing to remember is all the characters are Unicode in UTF-8 - the parser is just choking when it hits a byte sequence that isn't valid in the declared encoding. UTF-8 is a very forgiving encoding to use as any character in the 7-bit ASCII set is encoded as if it is plain ASCII, and a lot of XML is made up of nothing but plain ASCII characters. This then catches people out when something non-ASCII comes up and suddenly defects in the text encoding paths through a system become apparent.
You could try editing the XML declaration and see if it parses ok under another character encoding; 8859-7 contains the Δ symbol - could it be encoded in that?
Also, what is the exception?