I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8. Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance, Allan
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.