Using Java, I would like to take a document in the following format:
<tag1>
<tag2>
<![CDATA[ Some data ]]>
</tag2>
</tag1>
and convert it to:
<tag1><tag2><![CDATA[ Some data ]]></tag2></tag1>
I tried the following, but it isn't giving me the result I am expecting:
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
dbfac.setIgnoringElementContentWhitespace(true);
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
Document doc = docBuilder.parse(new FileInputStream("/tmp/test.xml"));
Writer out = new StringWriter();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "no");
tf.transform(new DOMSource(doc), new StreamResult(out));
System.out.println(out.toString());
As documented in an answer to another question, the relevant function would be DocumentBuilderFactory.setIgnoringElementContentWhitespace(), but - as pointed out here already - that function requires the use of a validating parser, which requires an XML schema, or some such.
Therefore, your best bet is to iterate through the Document you get from the parser, and remove all nodes of type TEXT_NODE (or those TEXT_NODEs which contain only whitespace).
Working solution following instructions in the question's comments by @Luiggi Mendoza.
Try this code.
read
andwrite
methods in FileStream ignore whitespace and indents.recursively traverse the document. remove any text nodes with blank content. trim any text nodes with non-blank content.