Parsing Mixed-Content XML with SAX

2019-09-12 01:25发布

问题:

I have a sample mixed-content XML document (structure cannot be modified):

<items>
    <item>  ABC123    <status>UPDATE</status>
    <units>
        <unit Description="Each     ">EA     <saleprice>2.99</saleprice>
            <saleprice2/>
        </unit>
    </units>
    <warehouses>
        <warehouse>100<availability>2987.000</availability>
        </warehouse>
    </warehouses>
    </item>
</items>

I am attempting to use SAX parser on this XML document, but the mixed-content elements are causing some issues. Namely, I get an empty String returned when attempting to handle the <item/> node.

My handler:

@Override
public void startElement(final String uri, 
        final String localName, final String qName, final Attributes attributes) throws SAXException {

    final String fixedQName = qName.toLowerCase();
    switch (fixedQName) {
        case "item":
            prod = new Product();
            //prod.setItem(content); <-- doesn't work, content is empty since element just started
            break;
    }

}

@Override
public void endElement(final String uri, final String localName, final String qName) throws SAXException {
    final String fixedQName = qName.toLowerCase();
    switch (fixedQName) {
        case "item":
            prod.setItem(content); // <-- doesn't work either, only returns an empty string
            // end element, set item
            productList.add(prod);
            break;
        case "status":
            prod.setStatus(content);
            break;
        // ... etc....
    }

}

@Override
public void characters(final char[] ch, final int start, final int length) throws SAXException {
    content = "";
    content = String.copyValueOf(ch, start, length).trim();
}

This handler works correctly for everything of interest, except the <item/> element. It always returns an empty string.

If I add a println() to the characters() method to print out the content, I can see the parser eventually does print the contents of <item/>, however it is after it is expected (on the next additional characters() method invocation by the parser)

Referencing http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html, I know I should attempt to aggregate the strings returned from characters(), however I don't see how this can be since I do need to retrieve the other element's data, and hard-coding an exception for the first element into the characters() method seems like the wrong approach.

Howe can I use SAX to retrieve the mixed-content <item/>'s data 'ABC123'?

回答1:

If the item content is only made of the text before the opening tag of the status element then you could get the item content in startElement:

public void startElement(final String uri, 
    final String localName, final String qName, final Attributes attributes) throws SAXException {

    final String fixedQName = qName.toLowerCase();
    switch (fixedQName) {
         case "item":
             prod = new Product();
             break;
         case "status":
             prod.setItem(content);
             break;
    }
}

To understand consider the flow of events:

  • startElement item
  • characters "ABC123"
  • startElement status
  • characters "UPDATE"
  • endElement status
  • characters ""
  • endElement item


标签: java xml sax