I am creating a WordML document from an xml file whose elements sometimes contain html-formatted text.
<w:p>
<w:r>
<w:t> html formatted content is in here taken from xml file! </w:t>
</w:r>
</w:p>
This is how my templates are sort of set up. I have a recursive call-template function that does text replacement against the source xml content. When it comes across a "<b>
" tag, I output a string in CDATA containing "</w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t>
" to close the current run and start up a new run with bold formatting enabled. when it gets to a "</b>
" tag, it replaces it with the following CDATA string "</w:t></w:r><w:r><w:t>
".
What I'd like to do is use XSL to close the run tag and start a new run without using CDATA string inserts. Is this possible?
Working with WordML is tricky. One tip when converting arbitrary XML to WordML using XSLT is to not worry about the text runs when processing blocks, but to instead create a template that matches text() nodes directly, and create the text runs there. It turns out that Word doesn't care if you nest text runs, which makes the problem much easier to solve.
<xsl:template match="text()" priority="1">
<w:r>
<w:t>
<xsl:value-of select="."/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
<xsl:template match="para">
<w:p>
<xsl:apply-templates select="text() | *" />
</w:p>
</xsl:template>
<xsl:template match="b">
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t><xsl:apply-templates /></w:t>
</w:r>
</xsl:template>
This avoids the bad XSLT technique of inserting tags directly as escaped text. You'll end up with the bold tag as a nested text run, but as I said, Word couldn't care less. If you use this technique, you'll need to be careful to not apply templates to the empty space between paragraphs, since it will trigger the text template and create an out-of-context run.
I can most probably help you if only I understood your problem... Is the html in a CDATA section or is it parsed as part of the input doc (and thus well-formed XML)?
Since you talk about 'text replacement' I'll assume that you treat the 'html formatted content' as a single string (CDATA) and therefor need a recursive call-template function to perform string replacement. The only way you're going to be able to use an XSL matching template to do what you're doing now is to make the html part of the parsed document (your input document). In such a case you could just match the b
tag and replace it with the appropriate output (again: this assumes that it can always be parsed as valid XML). Your problem now has shifted... since (if I understood your problem correctly) what you're trying to do is close the w:t
and w:r
elements and then 'reopen' them... this is hard because it's (as you probably suspect) very hard to do this nicely in XSLT (you cannot just create an element in template A and then close it in template B). You'll have to start messing with unescaped output etc. to make this happen. I now I've made a lot of assumptions but here is a small example to help you on your way:
input.xml
<doc xmlns:w="urn:schemas-microsoft-com:office:word">
<w:p>
<w:r>
<w:t>before<b>bold</b>after</w:t>
</w:r>
</w:p>
</doc>
convert_html.xsl
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/doc/w:p/w:r/w:t//b">
<xsl:value-of select="'</w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t>'" disable-output-escaping="yes" />
<xsl:apply-templates select="@*|node()"/>
<xsl:value-of select="'</w:t></w:r><w:r><w:t>'" disable-output-escaping="yes" />
</xsl:template>
Now running
xalan input.xml convert_html.xsl
produces
<?xml version="1.0" encoding="UTF-8"?><doc xmlns:w="urn:schemas-microsoft-com:office:word">
<w:p>
<w:r>
<w:t>before</w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t>bold</w:t></w:r><w:r><w:t>after</w:t>
</w:r>
</w:p>
</doc>
which I guess is what you wanted.
Hope this helps you somewhat.
From your description, it sounds like you can parse the embedded html. If so, simply applying templates should do what you want. The wordML in the output may not be right, but hopefully this will help.
Sample input:
<text>
<para>
Test for paragraph 1
</para>
<para>
Test for <b>paragraph 2</b>
</para>
</text>
Transform:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://foo">
<xsl:template match="/">
<w:p>
<w:r>
<xsl:apply-templates/>
</w:r>
</w:p>
</xsl:template>
<xsl:template match="para">
<w:t>
<xsl:apply-templates/>
</w:t>
</xsl:template>
<xsl:template match="b">
<w:rPr>
<w:b/>
</w:rPr>
<xsl:value-of select="."/>
</xsl:template>
</xsl:stylesheet>
Result:
<w:p xmlns:w="http://foo">
<w:r>
<w:t>
Test for paragraph 1
</w:t>
<w:t>
Test for <w:rPr><w:b /></w:rPr>paragraph 2
</w:t>
</w:r>
</w:p>
To completely finish the HTML > WordML I recommend this edited version of your code:
<xsl:template match="Body"><xsl:apply-templates select="p"/></xsl:template>
<xsl:template match="text()" priority="1"><w:r><w:t><xsl:value-of select="."/></w:t></w:r></xsl:template>
<xsl:template match="@*|node()"><xsl:apply-templates select="@*|node()"/></xsl:template>
<xsl:template match="p"><w:p><xsl:apply-templates select="text() | *" /></w:p></xsl:template>
<xsl:template match="b"><w:r><w:rPr><w:b /></w:rPr><xsl:apply-templates /></w:r></xsl:template>
<xsl:template match="i"><w:r><w:rPr><w:i /></w:rPr><xsl:apply-templates /></w:r></xsl:template>
<xsl:template match="u"><w:r><w:rPr><w:u w:val="single" /></w:rPr><xsl:apply-templates /></w:r></xsl:template>
supposing you have your HTML somewhere in a XMl wrapped in a tag