XML parsing issue with '&' in element text

2019-04-19 21:32发布

I have the following code:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));

And the parse step is throwning:

SAXParseException: The entity name must immediately follow 
                   the '&' in the entity reference

due to the following '&' in my inputXml:

<Line1>Day & Night</Line1>

I'm not in control of in the inbound XML. How can I safely/correctly parse this?

标签: java xml parsing
4条回答
做个烂人
2楼-- · 2019-04-19 21:38

Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:

<Line1>Day &amp; Night</Line1>

Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting them to fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.

If there's no entities in the document at all, a regex replace of & with &amp; before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent &amp; but meant &amp;amp;) you're going to be completely out of luck.

But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)

查看更多
Luminary・发光体
3楼-- · 2019-04-19 21:39

Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.

You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& ' with '&amp; ', this isn't going to catch every occurrence of & in the input, but you may be able to come up with something that does.

查看更多
Summer. ? 凉城
4楼-- · 2019-04-19 21:53

I used Tidy framework before xml parsing

final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));  
...

And all Ok

查看更多
啃猪蹄的小仙女
5楼-- · 2019-04-19 21:56

is inputXML a string? Then use this:

inputXML = inputXML.replaceAll("&\\s+", "&amp;");
查看更多
登录 后发表回答