We have a bunch of files that are html pages but which contain additional xml elements (all prefixed with our company name 'TLA') to provide data and structure for an older program which I am now rewriting.
Example Form:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<TLA:document xmlns:TLA="http://www.tla.com">
<TLA:contexts>
<TLA:context id="id_1" value=""></TLA:context>
</TLA:contexts>
<TLA:page>
<TLA:question id="q_id_1">
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
</TLA:question>
</TLA:page>
<!-- Repeat many times -->
</TLA:document>
</body>
</html>
My task is to write a pre-processor that will copy only the html elements, complete with their attributes and content into a new file.
Like this:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
<table>
<tr>
<td>
<input id="input_id_1" type="text" />
</td>
</tr>
</table>
<!-- Repeat many times -->
</body>
</html>
I've taken the approach of using XSLT as that was what I needed to extract the TLA elements for a different file. So far this is the XSLT I have:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
xmlns:mbl="http://www.mbl.com">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:template match="mbl:* | mbl:*/@* | mbl:*/text()"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
However this only produces the following:
<html >
<head>
<title>Highly Simplified Example Form</title>
</head>
<body>
</body>
</html>
As you can see everything within the TLA:document element is excluded. What needs to be changed in the XSLT to get all the html but filter out the TLA elements?
Alternatively, is there a simpler way to go about this? I know that virtually every browser will ignore the TLA elements so is there a way to get what I need using an HTML tool or app?
Specifically targeting HTML elements would be hard, but if you just want to exclude content from the TLA namespace (but still include any non-TLA elements that the TLA elements contain), then this should work:
You can also use this instead if you want to exclude anything that has any non-null namespace:
When either is run on your sample input, the result is: