How can I parse XML that confirms to the 1.1 spec

2020-02-11 07:21发布

I'm trying to parse a String which contains XML content which conforms to the XML 1.1 spec. The XML contains character references which are not allowed in the XML 1.0 spec but which are allowed in the XML 1.1 spec (character references which translate to Unicode characters in the range U+0001–U+001F).

According the Xerces2 website, the Xerces2 parser supports parsing XML 1.1 documents. However, I cannot figure out how to tell it the XML we are trying to parse contains 1.1-compliant XML.

I'm using a DocumentBuilder to parse the XML (something like this):

public Element parseString(String xmlString) {
    try {
          DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
          DocumentBuilder documentBuilder = dbf.newDocumentBuilder();

          InputSource source = new InputSource(new StringReader(xmlString));

      // Throws org.xml.sax.SAXParseException becuase of the invalid character refs
          Document doc = documentBuilder.parse(source);

          return doc.getDocumentElement();

    } catch (ParserConfigurationException pce) {
          // Handle the error
    } catch (SAXException se) {
          // Handle the error
    } catch (IOException ioe) {
          // Handle the error
    }
}

I've tried setting the XML header to indicate the XML conforms to the 1.1 spec...

xmlString = "<?xml version=\"1.1\" encoding=\"UTF-8\" ?>" + xmlString;

...but it is still parsed as 1.0 XML (still generates the invalid character reference exceptions).

How can I configure the Xerces parser to parse the XML as XML 1.1? Is there an alternative parser which provides better support for XML 1.1?

2条回答
干净又极端
2楼-- · 2020-02-11 07:45

Not sure how to do this with Xerces, but Woodstox supports XML 1.1 out of the box. While it is primarily a Stax parser, it also implements SAX API (since version 3.2).

查看更多
Evening l夕情丶
3楼-- · 2020-02-11 07:50

See here for a list of all the features supported by xerces. May be below 2 features is what you have to turn on.

http://xml.org/sax/features/unicode-normalization-checking

True: Perform Unicode normalization checking (as described in section 2.13 and Appendix B of the XML 1.1 Recommendation) and report normalization errors.

False: Do not report Unicode normalization errors.

http://xml.org/sax/features/xml-1.1

True: The parser supports both XML 1.0 and XML 1.1.
False: The parser supports only XML 1.0.
Access: read-only Since: Xerces-J 2.7.0 Note: The value of this feature will depend on whether the parser configuration owned by the SAX parser is known to support XML 1.1.

查看更多
登录 后发表回答