Copy only HTML from mixed xml and HTML

2019-07-15 17:57发布

We have a bunch of files that are html pages but which contain additional xml elements (all prefixed with our company name 'TLA') to provide data and structure for an older program which I am now rewriting.

Example Form:

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
    <TLA:document xmlns:TLA="http://www.tla.com">
        <TLA:contexts>
            <TLA:context id="id_1" value=""></TLA:context>
        </TLA:contexts>
        <TLA:page>
            <TLA:question id="q_id_1">
                <table>
                    <tr>
                        <td>
                            <input id="input_id_1" type="text" />
                        </td>
                    </tr>
                </table>
            </TLA:question>
        </TLA:page>
        <!-- Repeat many times -->
    </TLA:document>
</body>
</html>

My task is to write a pre-processor that will copy only the html elements, complete with their attributes and content into a new file.

Like this:

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
    <table>
        <tr>
            <td>
                <input id="input_id_1" type="text" />
            </td>
        </tr>
    </table>
    <!-- Repeat many times -->
</body>
</html>

I've taken the approach of using XSLT as that was what I needed to extract the TLA elements for a different file. So far this is the XSLT I have:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
    xmlns:mbl="http://www.mbl.com">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />
  <xsl:template match="mbl:* | mbl:*/@* | mbl:*/text()"/>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>    
</xsl:stylesheet>

However this only produces the following:

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
</body>
</html>

As you can see everything within the TLA:document element is excluded. What needs to be changed in the XSLT to get all the html but filter out the TLA elements?

Alternatively, is there a simpler way to go about this? I know that virtually every browser will ignore the TLA elements so is there a way to get what I need using an HTML tool or app?

标签: html xslt
1条回答
甜甜的少女心
2楼-- · 2019-07-15 18:29

Specifically targeting HTML elements would be hard, but if you just want to exclude content from the TLA namespace (but still include any non-TLA elements that the TLA elements contain), then this should work:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->  
  <xsl:template match="mbl:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

You can also use this instead if you want to exclude anything that has any non-null namespace:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <xsl:template match="*[namespace-uri()]">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

When either is run on your sample input, the result is:

<html>
  <head>
    <title>Highly Simplified Example Form</title>
  </head>
  <body>
    <table>
      <tr>
        <td>
          <input id="input_id_1" type="text" />
        </td>
      </tr>
    </table>
  </body>
</html>
查看更多
登录 后发表回答