I wish to replace special characters like & ndash; and & mdash; occuring in an xml document with corresponding code like & #150; etc
i have an input xml document containing several special characters
<?xml version="1.0"?>
<!DOCTYPE BOOK SYSTEM "bookfull.dtd">
<BOOK>
<P>The war was between1890–1900
<AF>something—something else</AF>
</P>
</BOOK>
there are several other characters like & rsquo; for single quotation
my xslt code is as follows
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />
<xsl:param name="pDest"
select="'file:///d:/LWW_Book_ePub_Transform/Epub_ZipCreation/XSLT_Transform/Output/'" />
<xsl:template-match="P">
<html>
<xsl:apply-templates/>
</html>
</xsl:template-match>
<xsl:template-match="AF">
.....
<xsl:apply-templates/>
.....
</xsl:template-match>
</xsl:stylesheet>
my java codes for parsing is as follow (i am making use of saxon9.)
package com.xsltprocessor;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.w3c.dom.Document;
public class ParseUsingSAX {
public ParseUsingSAX() {
}
public void parseBookContent(String xsltFile) {
try {
//File inputXml = new File("D:\\data\\myxml.0f");
File xslt = new File(xsltFile);
TransformerFactory factory = TransformerFactory.newInstance();
Templates template = factory.newTemplates(new StreamSource(new FileInputStream(xslt)));
Transformer xformer = template.newTransformer();
Source source = new StreamSource(new FileInputStream(inputXml));
StreamResult result = new StreamResult();
xformer.transform(source,result);
System.out.println("DONE");
}
catch (Exception ex) {
// TODO Auto-generated catch block
ex.printStackTrace();
System.out.println("IO exception: " + ex.getMessage());
}
}
}
i am getting the output after transformation as
<html>
The war was between1890–1900
</html>
expected output
<html>
The war was between1890–1900
</html>
Use an xsl:character-map
element that controls output serialization.
<xsl:character-map name="dashes">
<xsl:output-character character="–" string="–"/>
</xsl:character-map>
You also have to declare
<xsl:output use-character-maps="dashes"/>
as a top-level element to ensure that the character mapping is used.
As I mentioned in my comments, –
is an HTML named entity that needs to be declared in XSLT. See e.g. this discussion for more detail.
Embedded into the stylesheet you show (this outputs dummy strings "MDASH" and "NDASH" - just for illustration):
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE stylesheet [
<!ENTITY ndash "–" >
<!ENTITY mdash "—" >
]>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="html" omit-xml-declaration="yes" indent="yes" />
<xsl:output use-character-maps="dashes"/>
<xsl:strip-space elements="*" />
<xsl:character-map name="dashes">
<xsl:output-character character="–" string="NDASH"/>
<xsl:output-character character="—" string="MDASH"/>
</xsl:character-map>
<xsl:param name="pDest"
select="'file:///d:/LWW_Book_ePub_Transform/Epub_ZipCreation/XSLT_Transform/Output/'" />
<xsl:template match="BOOK">
<html>
<xsl:apply-templates/>
</html>
</xsl:template>
<xsl:template match="AF|P">
<xsl:copy>
<xsl:value-of select="."/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Note that this does not have an effect on output produced with xsl:result-document
(since you did not show your entire stylesheet). For more info on character-maps please refer to a previous answer of mine and the official recommendation.
Either the DTD mentioned at <!DOCTYPE BOOK SYSTEM "bookfull.dtd">
will include the entity references used (like –
) or it is in error (or I suppose the input XML could have been in error in trying to use an entity it should be able to use).
If it does include them, then you need to set your XSLT processor to validate the document according to its DTD. (I don't know how to do this in your case, as I know the XSLT part of the problem, but not the specifics of how to use XSLT in Java).
If not, you'll have to fix it.
Get a copy of http://www.w3.org/2003/entities/2007/w3centities-f.ent
(while it would work to just reference that URI itself, the W3 would prefer if you didn't, and you'll not have better performance this way).
Then create your own bookfull.dtd that includes:
<!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
"w3centities-f.ent">
%w3centities-f;
Or alternatively, that includes the contents of that file directly within the DTD.
Now in interpreting the input document, the entity references can be resolved. For example, –
in the above is defined by:
<!ENTITY ndash "–" ><!--EN DASH -->
Or in other words; "whenever –
appears, replace it with –
".
This happens at the XML parsing step prior to the XSLT stylesheet being run, so as far as the XSLT is concerned, the content it received contained –
, not –
, and it treats it as such.