We have a bunch of files that are html pages but which contain additional xml elements (all prefixed with our company name 'TLA') to provide data and structure for an older program which I am now rewriting.
Example Form:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<TLA:document xmlns:TLA="http://www.tla.com">
<TLA:contexts>
<TLA:context id="id_1" value=""></TLA:context>
</TLA:contexts>
<TLA:page>
<TLA:question id="q_id_1">
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
</TLA:question>
</TLA:page>
<!-- Repeat many times -->
</TLA:document>
</body>
</html>
My task is to write a pre-processor that will extract all the 'TLA' elements and ignore the html elements
Desired XML Output:
<?xml version="1.0" encoding="utf-8" ?>
<TLA:document xmlns:TLA="http://www.tla.com">
<TLA:contexts>
<TLA:context id="id_1" value=""></TLA:context>
</TLA:contexts>
<TLA:page>
<TLA:question id="q_id_1">
</TLA:question>
</TLA:page>
<!-- Repeat many times -->
</TLA:document>
This should be doable with XSLT but I'm unable to formulate the correct code. This is what I have so far:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
xmlns:tla="http://www.tla.com"
>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="tla:*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Which is extracting the elements I want (but not their attributes!) but also extracts the text attributes and content of the html elements. How can I exclude the html elements and their content?
You could try something like this...
XSLT 1.0
This should do it:
When run on your sample input (once the missing namespace declaration is added), the result is: