Keep numeric character entity characters such as `

2020-02-12 10:42发布

问题:

I am parsing XML that contains numeric character entity characters such as (but not limited to) &#10; &#13; &lt; &gt; (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.

However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.

How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?

Example of demo xml file:

<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">    
    <Field attributeWithChar="A string followed by special symbols &#13;  &#10;" />
</ABCD>

Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no &#10; &#13;) symbols.

What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {   
    DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
    Document document = null;
    DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
    document = documentBuilder.parse(new File("path/to/demo.xml"));
    StringBuilder sb = new StringBuilder();

    NodeList nodeList = document.getElementsByTagName("*");
    for (int i = 0; i < nodeList.getLength(); i++) {
        Node node = nodeList.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            NamedNodeMap nnp = node.getAttributes();
            for (int j = 0; j < nnp.getLength(); j++) {
                sb.append(nnp.item(j).getTextContent());
            }
        }
    }
    System.out.println(sb.toString());

    try (Writer writer = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
        writer.write(sb.toString());
    }
}

回答1:

You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &amp;. Something like,

DocumentBuilder documentBuilder =
        DocumentBuilderFactory.newInstance().newDocumentBuilder();

String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");

Document document = documentBuilder.parse(
         new InputSource(new StringReader(xmlContents.replaceAll("&", "&amp;"))
        ));

Output :

2A string followed by special symbols &#13;  &#10;


回答2:

P.S. This is complement of Ravi Thapliyal's answer, not an alternative.

I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as &#10; along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.

When writing the XML content back to the file, it is necessary to replace all &amp; with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.

TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);

//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);

t.transform(source, result);

//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&amp;" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&amp;", "&");

BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();