I wish to replace special characters like & ndash; and & mdash; occuring in an xml document with corresponding code like & #150; etc
i have an input xml document containing several special characters
<?xml version="1.0"?>
<!DOCTYPE BOOK SYSTEM "bookfull.dtd">
<BOOK>
<P>The war was between1890–1900
<AF>something—something else</AF>
</P>
</BOOK>
there are several other characters like & rsquo; for single quotation
my xslt code is as follows
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />
<xsl:param name="pDest"
select="'file:///d:/LWW_Book_ePub_Transform/Epub_ZipCreation/XSLT_Transform/Output/'" />
<xsl:template-match="P">
<html>
<xsl:apply-templates/>
</html>
</xsl:template-match>
<xsl:template-match="AF">
.....
<xsl:apply-templates/>
.....
</xsl:template-match>
</xsl:stylesheet>
my java codes for parsing is as follow (i am making use of saxon9.)
package com.xsltprocessor;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.w3c.dom.Document;
public class ParseUsingSAX {
public ParseUsingSAX() {
}
public void parseBookContent(String xsltFile) {
try {
//File inputXml = new File("D:\\data\\myxml.0f");
File xslt = new File(xsltFile);
TransformerFactory factory = TransformerFactory.newInstance();
Templates template = factory.newTemplates(new StreamSource(new FileInputStream(xslt)));
Transformer xformer = template.newTransformer();
Source source = new StreamSource(new FileInputStream(inputXml));
StreamResult result = new StreamResult();
xformer.transform(source,result);
System.out.println("DONE");
}
catch (Exception ex) {
// TODO Auto-generated catch block
ex.printStackTrace();
System.out.println("IO exception: " + ex.getMessage());
}
}
}
i am getting the output after transformation as
<html>
The war was between1890–1900
</html>
expected output
<html>
The war was between1890–1900
</html>
Either the DTD mentioned at
<!DOCTYPE BOOK SYSTEM "bookfull.dtd">
will include the entity references used (like–
) or it is in error (or I suppose the input XML could have been in error in trying to use an entity it should be able to use).If it does include them, then you need to set your XSLT processor to validate the document according to its DTD. (I don't know how to do this in your case, as I know the XSLT part of the problem, but not the specifics of how to use XSLT in Java).
If not, you'll have to fix it.
Get a copy of
http://www.w3.org/2003/entities/2007/w3centities-f.ent
(while it would work to just reference that URI itself, the W3 would prefer if you didn't, and you'll not have better performance this way).Then create your own bookfull.dtd that includes:
Or alternatively, that includes the contents of that file directly within the DTD.
Now in interpreting the input document, the entity references can be resolved. For example,
–
in the above is defined by:Or in other words; "whenever
–
appears, replace it with–
".This happens at the XML parsing step prior to the XSLT stylesheet being run, so as far as the XSLT is concerned, the content it received contained
–
, not–
, and it treats it as such.Use an
xsl:character-map
element that controls output serialization.You also have to declare
as a top-level element to ensure that the character mapping is used.
As I mentioned in my comments,
–
is an HTML named entity that needs to be declared in XSLT. See e.g. this discussion for more detail.Embedded into the stylesheet you show (this outputs dummy strings "MDASH" and "NDASH" - just for illustration):
Note that this does not have an effect on output produced with
xsl:result-document
(since you did not show your entire stylesheet). For more info on character-maps please refer to a previous answer of mine and the official recommendation.