Best way to “fix” malformed html for use in an xsl

2019-09-15 04:05发布

问题:

I have an input xml document that contains mal-formed html which has been xml encoded. i.e. the xml document itself is technically valid.

Now I am applying an xsl transform to the xml which output well-formed xhtml5 but contains the mal-formed html.

Examples of the bad html:

  • html, head and body tags in html fragments.
  • font tags
  • mismatched quotes
  • unclosed tags
  • extra close tags with no matching open
  • close tags in the wrong order (e.g. <b><u>text</b></u>)

Now in my situation I actually don't care that the html is mal-formed - I only care that my closing tags match my opening tags, regardless of what goes in between.

So my question is - what is the best way to either

  1. Clean up the html sufficiently that it does not affect other tags (preferably from within the transform itself)
  2. or somehow mark a closetag so that html5 compatible browsers recognise it as matching a particular open tag regardless of whatever nasty markup may be in between.

for 2. I have no ideas at all. I have a couple of ideas for 1. such as calling an external tool like tidy or using a .NET sgml parser

.NET xsl scripts (msxsl:script) are acceptable, if undesirable.

Example source:

<xml>
  &lt;b&gt;&lt;u&gt;bad html&lt;/b&gt;&lt;/u&gt;
<xml>

Example output:

<div id="MyDiv">
  <b><u>bad html</b></u>
</div> <!-- this /div absolutly must match the opening div regardless of what might be in the bad html -->

What other approaches are available?

C#, VS2012, xslt 1.0 only

回答1:

Is using a third party library acceptable? The HTML Agility Pack (available on NuGet) might got part of the way to solving your invalid HTML and it also (according to the website) supports XSLT.



回答2:

I went for using a sgml parsing library and converting to valid xml.

I went for Mind Touch's library: https://github.com/MindTouch/SGMLReader

Once compiled and added to the GAC I could use this xsl:

<msxsl:script language="C#" implements-prefix="myns">
  <msxsl:assembly name="SgmlReaderDll, Version=1.8.11.0, Culture=neutral, PublicKeyToken=46b2db9ca481831b"/>
    <![CDATA[
 public XPathNodeIterator SGMLStringToXml(string strSGML)
 {
 Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
 sgmlReader.DocType = "HTML";
 sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
 sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
 sgmlReader.InputStream = new System.IO.StringReader(strSGML);

 // create document
 XmlDocument doc = new XmlDocument();
 doc.PreserveWhitespace = true;
 doc.XmlResolver = null;
 doc.Load(sgmlReader);
 return doc.CreateNavigator().Select("/*");
 }

 public string CurDir()
 {
 return (new System.IO.DirectoryInfo(".")).FullName;
 }
  ]]>

</msxsl:script>
<xsl:template match="node()" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
    <xsl:text> </xsl:text>
  </xsl:copy>
</xsl:template>
<xsl:template match="@*" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

and use it like so:

<xsl:apply-templates select="myns:SGMLStringToXml(.)/body/*" mode="PreventSelfClosingTags"/>

N.B. You have to run the transform manually with an XslCompiledTransform instance. The asp:xml control doesn't like the DLL reference.