I'm setting the characters inside the xml element in the following event:
public void characters(char[] ch, int start, int length) {
elementText = new String(ch, start, length);
}
Where elementText is a String.
<client-key>#<ABC::DEF::GHI:0x102548f78></client-key>
I am loading this xml data into java objects, and my objects property has this value:
'\n '
Now if I change the text in the element <client-key>
above, it comes out fine in my objects property.
Is there some encoding issue that I need to handle somehow?
public void endElement(String uri, String localName, String qName) {
if (qName.equals("client-key")) {
client.setClientKey(elementText);
}
}
An XML parser typically uses two stages to process the data in a document. In the first stage, the document (which is a sequence of bytes) is decoded into a sequence of characters which are placed in an input buffer. The actual XML parsing is done in a second stage, where the different constructs such as element start and end tags are analyzed. Note that both stages are executed in parallel. More precisely, the input buffer is refilled on demand as the XML parsing progresses. Also note that if the document is already supplied as a character sequence (e.g. using a
StringReader
), then the decoding in the first stage is skipped, but the parser will still use an input buffer to store the characters read from the stream.As noted by others, a SAX parser is not required to report a text node as a single chunk. It may at its own discretion decide to split the node into multiple chunks. This is called non-coalescing parsing.
What you call "funny characters" are actually character entity references (< and > in your case). They need to be decoded (to '<' and '>' in your case) before sending the data to the application. However, this can only be done in the second stage. The reason is that the same character sequence (e.g. '<') may not need decoding if it appears in a different context, in particular in a CDATA section.
The point is that if a text node doesn't contain any entity references, then the parser can pass the character data directly from the input buffer to the application. This increases the probability that the entire text node is reported as a single chunk. However, even in that case, it is possible that the text node doesn't fit entirely into the input buffer, in which case the parser will report it in multiple chunks.
On the other hand, if the text node contains entity references, then the parser can't pass the data directly from the input buffer to the application, because part of the data needs further decoding. To avoid copying the data around multiple times, most parsers will choose to pass the parts that don't need further decoding directly to the application, while the entity references are decoded into a separate buffer first. That is the reason why you get chunks that in the original document are delimited by entity references.
It works fine. But as he said, content of the node comes in multiple chunks. So you need to append it. The below example shows the output with and without using cdata
The output:
The last chunk that you receive, for the first client-key tag, is the new line character with some spaces. Since you dont append it you are only getting the newline character with some spaces which is the last chunk.
It works fine if you have a normal character because there is no break in the content and you may get them in one chunk.
same input :
output:
So either you use CDATA or append.
This is probably what you would get if your xml has been tidied to look like:
See ContentHandler
You'd be better off using something like: