XML parsing with SAX | how to handle special chara

2019-05-15 01:33发布

问题:

We have a JAVA application that pulls the data from SAP, parses it and renders to the users. The data is pulled using JCO connector.

Recently we were thrown an exception:

org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.

So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.

My questions here are :

  1. Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
  2. Or if I had to write such utility, how should i handle them?
  3. Why is the above exception thrown?

Thank You.

回答1:

From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.

While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).

regards Guillaume



回答2:

It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.

Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:

String goodXml = badXml.replaceAll("&#00;", "");


回答3:

I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.

If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.

I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.



回答4:

You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:

http://commons.apache.org/lang/api-2.4/index.html

To read about how XML character references work, search for "numeric character references" on wikipedia.