I'm using the javax.xml.transform.Transformer class to perform some XSLT translations, like so:
TransformerFactory factory = TransformerFactory.newInstance();
StreamSource source = new StreamSource(TRANSFORMER_PATH);
Transformer transformer = factory.newTransformer(source);
StringWriter extractionWriter = new StringWriter();
String xml = FileUtils.readFileToString(new File(sampleXmlPath));
transformer.transform(new StreamSource(new StringReader(xml)),
new StreamResult(extractionWriter));
System.err.println(extractionWriter.toString());
However, no matter what I do I can't seem to avoid having the transformer convert any tabs that were in the source document in to their character entity equivalent (	
). I have tried both:
transformer.setParameter("encoding", "UTF-8");
and:
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
but neither of those help. Does anyone have any suggestions? Because:
					<MyElement>
looks really stupid (even if it does work).
So the answer to this one turned out to be pretty lame: update Xalan. I don't know what was wrong with my old version, but when I switched to the latest version at:
http://xml.apache.org/xalan-j/downloads.html
suddenly the entity-escaping of tabs just went away. Thanks everyone for all your help though.
You could try using a SAXTransformerFactory in combination with a XMLReader.
Something like:
SAXTransformerFactory transformFactory = (SAXTransformerFactory) TransformerFactory.newInstance();
StreamSource source = new StreamSource(TRANSFORMER_PATH);
StringWriter extractionWriter = new StringWriter();
TransformerHandler transformerHandler = null;
try {
transformerHandler = transformFactory.newTransformerHandler(source);
transformerHandler.setResult(new StreamResult(extractionWriter));
} catch (TransformerConfigurationException e) {
throw new SAXException("Unable to create transformerHandler due to transformer configuration exception.");
}
XMLReader reader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();
reader.setContentHandler(transformerHandler);
reader.parse(new InputSource(new FileReader(xml)));
System.err.println(extractionWriter.toString());
You should be able to set the SAX parser to not include ignorable whitespace, if it doesn't already do it by default. I haven't actually tested this, but I do something similar in one of my projects.
Sometimes with things like this, replacing them yourself with regex afterwards is not an entirely bad option, which at least gets you going until you find a better option later.
Is there any reason you are reading the file into a string first instead of using a file stream directly?
Instead of
String xml = FileUtils.readFileToString(new File(sampleXmlPath));
transformer.transform(new StreamSource(new StringReader(xml)),
new StreamResult(extractionWriter));
You could try
transformer.transform(new StreamSource(new FileReader(sampleXmlPath)),
new StreamResult(extractionWriter));
This may not be the cause of the problem, but I've seen it cause similar problems before. If your FileUtils.readFileToString is the Commons.IO version, it's reading the string in as UFT-16 (the Java default, IIRC) instead of what you want, which is UTF-8.