Xml node text is causing issues when it has funny

2019-06-14 20:08发布

I'm setting the characters inside the xml element in the following event:

 public void characters(char[] ch, int start, int length) {
        elementText = new String(ch, start, length);
    }

Where elementText is a String.

<client-key>#&lt;ABC::DEF::GHI:0x102548f78&gt;</client-key>

I am loading this xml data into java objects, and my objects property has this value:

 '\n        '

Now if I change the text in the element <client-key> above, it comes out fine in my objects property.

Is there some encoding issue that I need to handle somehow?

public void endElement(String uri, String localName, String qName) {

       if (qName.equals("client-key")) {
            client.setClientKey(elementText);
        }

}

3条回答
beautiful°
2楼-- · 2019-06-14 20:32

An XML parser typically uses two stages to process the data in a document. In the first stage, the document (which is a sequence of bytes) is decoded into a sequence of characters which are placed in an input buffer. The actual XML parsing is done in a second stage, where the different constructs such as element start and end tags are analyzed. Note that both stages are executed in parallel. More precisely, the input buffer is refilled on demand as the XML parsing progresses. Also note that if the document is already supplied as a character sequence (e.g. using a StringReader), then the decoding in the first stage is skipped, but the parser will still use an input buffer to store the characters read from the stream.

As noted by others, a SAX parser is not required to report a text node as a single chunk. It may at its own discretion decide to split the node into multiple chunks. This is called non-coalescing parsing.

What you call "funny characters" are actually character entity references (&lt; and &gt; in your case). They need to be decoded (to '<' and '>' in your case) before sending the data to the application. However, this can only be done in the second stage. The reason is that the same character sequence (e.g. '&lt;') may not need decoding if it appears in a different context, in particular in a CDATA section.

The point is that if a text node doesn't contain any entity references, then the parser can pass the character data directly from the input buffer to the application. This increases the probability that the entire text node is reported as a single chunk. However, even in that case, it is possible that the text node doesn't fit entirely into the input buffer, in which case the parser will report it in multiple chunks.

On the other hand, if the text node contains entity references, then the parser can't pass the data directly from the input buffer to the application, because part of the data needs further decoding. To avoid copying the data around multiple times, most parsers will choose to pass the parts that don't need further decoding directly to the application, while the entity references are decoded into a separate buffer first. That is the reason why you get chunks that in the original document are delimited by entity references.

查看更多
趁早两清
3楼-- · 2019-06-14 20:38

It works fine. But as he said, content of the node comes in multiple chunks. So you need to append it. The below example shows the output with and without using cdata

public class XMLTest {

    public static void main(String argv[]) {
        try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {

                public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
                }

                public void endElement(String uri, String localName, String qName) throws SAXException {
                }

                public void characters(char ch[], int start, int length) throws SAXException {
                    System.out.println(new String(ch, start, length));
                }
            };
            saxParser.parse("test.xml", handler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

<?xml version="1.0"?>
<company>
    <staff>
        <client-key>#&lt;ABC::DEF::GHI:0x102548f78&gt;</client-key>    
        <client-key><![CDATA[#<ABC::DEF::GHI:0x102548f78>]]></client-key>    
    </staff>
</company>

The output:

#
<
ABC::DEF::GHI:0x102548f78
>


#<ABC::DEF::GHI:0x102548f78> 

The last chunk that you receive, for the first client-key tag, is the new line character with some spaces. Since you dont append it you are only getting the newline character with some spaces which is the last chunk.

It works fine if you have a normal character because there is no break in the content and you may get them in one chunk.

same input :

<client-key>testing</client-key>

output:

testing

So either you use CDATA or append.

查看更多
贼婆χ
4楼-- · 2019-06-14 20:56

This is probably what you would get if your xml has been tidied to look like:

<client-key>
    #&lt;ABC::DEF::GHI:0x102548f78&gt;
</client-key>

See ContentHandler

characters
...
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; ...

You'd be better off using something like:

public void characters(char[] ch, int start, int length) {
  // Note the +=
  elementText += new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName) {

  if (qName.equals("client-key")) {
    client.setClientKey(elementText);
  }
  elementText = "";
}
查看更多
登录 后发表回答