I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);