How to normalize XML on reverse domain name sortin

2020-05-06 14:09发布

问题:

I've been working on a Geo application. Over the time the product's XML has grown bit messy. The problem arises when synchronizing the changes across multiple environments, like Dev, Test, etc. I'm trying to figure out a way to normalize the content, so I can avoid some cumbersome while editing and merging, and hence, have a productive development. I know it sounds crazy, and there's lot on the background, but let me jump to the actual issue leaving the history.

Here's the issue:

  1. Multiple sorting orders applied, like:

    • Sort based on reverse domain name. For example, it should read d.c.b.a as a.b.c.d or map.google.com as com.google.map for sorting.
    • When the domain contains non-alphanumeric char, like *, ?, [, ], etc, then that node should be after the specific one as the scope is wide.
    • Sort on port & path as 2nd subsequent sorting.
    • Apply similar sorting order for tags under <tgt> element if present.
  2. Eliminate <scheme> and <port> tags when the values are generic, like http / https for scheme tag and 80 or 443 for port tag, otherwise retain. Also, remove if there's no value, like <scheme/>.
  3. Preserve all other tag and values as-is.
  4. Trivial thing like indent to 2 space characters and actual data without having wanted boilerplate stuff.

Here's a bit of the problematic XML:

XML

<?xml version='1.0' encoding='UTF-8' ?>
<?tapia chrome-version='2.0' ?>
<mapGeo>
  <a>blah</a>
  <b>blah</b>
  <maps>
    <mapIndividual>
      <src>
        <scheme>https</scheme>
        <domain>photos.yahoo.com</domain>
        <path>somepath</path>
        <query>blah</query>
      </src>
      <loc>C:\var\tmp</loc>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <scheme>tcp</scheme>
        <domain>map.google.com</domain>
        <port>80</port>
        <path>/value</path>
        <query>blah</query>
      </src>
      <tgt>
        <scheme>https</scheme>
        <domain>map.google.com</domain>
        <port>443</port>
        <path>/value</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <scheme>http</scheme>
        <domain>*.c.b.a</domain>
        <path>somepath</path>
        <port>8085</port>
        <query>blah</query>
      </src>
      <tgt>
        <domain>r.q.p</domain>
        <path>somepath</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <scheme>http</scheme>
        <domain>d.c.b.a</domain>
        <path>somepath</path>
        <port>8085</port>
        <query>blah</query>
      </src>
      <tgt>
        <domain>r.q.p</domain>
        <path>somepath</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
  <maps>
</mapGeo>

I was able to apply basic sorting on the values as is, but couldn't figure out a way to generate reverse domain name. I came across XSL extension, but haven't tried yet. Here's the beginning part of the solution I was working on, which is very basic.

XSL

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>

<xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="maps">
    <xsl:copy>
      <xsl:apply-templates select="*">
        <xsl:sort select="src/domain" />
        <xsl:sort select="src/port" />
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Expected Output

<?xml version='1.0' encoding='UTF-8' ?>
<?tapia chrome-version='2.0' ?>
<mapGeo>
  <a>blah</a>
  <b>blah</b>
  <maps>
    <mapIndividual>
      <src>
        <domain>d.c.b.a</domain>
        <path>somepath</path>
        <port>8085</port>
        <query>blah</query>
      </src>
      <tgt>
        <domain>r.q.p</domain>
        <path>somepath</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <domain>*.c.b.a</domain>
        <path>path1</path>
        <port>8085</port>
        <query>blah</query>
      </src>
      <tgt>
        <domain>r.q.p</domain>
        <path>path2</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <scheme>tcp</scheme>
        <domain>map.google.com</domain>
        <path>/value</path>
        <query>blah</query>
      </src>
      <tgt>
        <domain>map.google.com</domain>
        <path>/value</path>
        <query>blah</query>
      </tgt>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
    <mapIndividual>
      <src>
        <domain>photos.yahoo.com</domain>
        <path>somepath</path>
        <query>blah</query>
      </src>
      <loc>C:\var\tmp</loc>
      <x>blah</x>
      <y>blah</y>
    </mapIndividual>
  <maps>
</mapGeo>

Note: I'd prefer XSLT 1.0 as that's supported in the current environment. XSLT 2.0 would be a plus.

Update: I figured out solution to support XSLT 2.0 and XSLT 3.0, so please ignore my previous note for XSLT 1.0.

Thank you in Advance!

Cheers,

回答1:

I don't think it's possible to sort in the reverse order you seek in a single pass using XSLT 1.0. Consider the following simplified example:

XML

<root>
    <item>
        <domain>t.q.p</domain>
    </item>
    <item>
        <domain>s.q.p</domain>
    </item>
    <item>
        <domain>photos.yahoo.com</domain>
    </item>
    <item>
        <domain>map.google.com</domain>
    </item>
    <item>
        <domain>aap.google.com</domain>
    </item>
    <item>
        <domain>r.q.p</domain>
    </item>
    <item>
        <domain>*.c.b.a</domain>
    </item>
    <item>
        <domain>d.c.b.a</domain>
    </item>
</root>

XSLT 1.0 (+ EXSLT node-set)

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="/root">
    <!-- 1st pass -->
    <xsl:variable name="items">
        <xsl:for-each select="item">
            <xsl:copy>
                <xsl:attribute name="sort-string">
                    <xsl:call-template name="reverse-tokens">
                        <xsl:with-param name="text" select="domain"/>
                    </xsl:call-template>
                </xsl:attribute>
                <xsl:copy-of select="@*|node()"/>
            </xsl:copy>
        </xsl:for-each>
    </xsl:variable>
    <!-- output -->
    <xsl:copy>
        <xsl:apply-templates select="exsl:node-set($items)/item">
            <xsl:sort select="@sort-string" data-type="text" order="ascending"/>
        </xsl:apply-templates>
    </xsl:copy>
</xsl:template>

<xsl:template match="@sort-string"/>

<xsl:template name="reverse-tokens">
    <xsl:param name="text"/>
    <xsl:param name="delimiter" select="'.'"/>
    <xsl:variable name="token" select="substring-before(concat($text, $delimiter), $delimiter)"/>
    <xsl:if test="contains($text, $delimiter)">
        <!-- recursive call -->
        <xsl:call-template name="reverse-tokens">
            <xsl:with-param name="text" select="substring-after($text, $delimiter)"/>
        </xsl:call-template>
        <xsl:value-of select="$delimiter"/>
    </xsl:if>
    <xsl:choose>
        <xsl:when test="$token = '*'">
            <xsl:text>zzzz</xsl:text>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$token"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

Result

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <item>
    <domain>d.c.b.a</domain>
  </item>
  <item>
    <domain>*.c.b.a</domain>
  </item>
  <item>
    <domain>aap.google.com</domain>
  </item>
  <item>
    <domain>map.google.com</domain>
  </item>
  <item>
    <domain>photos.yahoo.com</domain>
  </item>
  <item>
    <domain>r.q.p</domain>
  </item>
  <item>
    <domain>s.q.p</domain>
  </item>
  <item>
    <domain>t.q.p</domain>
  </item>
</root>


回答2:

This XSLT 1.0 stylesheet (without extensions)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output indent="yes" />
    <xsl:strip-space elements="*"/>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="maps">
        <xsl:copy>
            <xsl:apply-templates select="*">
                <xsl:sort 
                    select="translate(src/domain,translate(src/domain,'.',''),'')" 
                    order="descending"/>
                <xsl:sort 
                    select="
                      substring-after(
                        substring-after(
                          substring-after(translate(src/domain,'*','~'),'.'),'.'),'.')"/>
                <xsl:sort 
                    select="
                        substring-after(
                            substring-after(translate(src/domain,'*','~'),'.'),'.')"/>
                <xsl:sort 
                    select="substring-after(translate(src/domain,'*','~'),'.')"/>
                <xsl:sort select="translate(src/domain,'*','~')" />
                <xsl:sort select="src/port" />
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Output

<?xml version="1.0" encoding="UTF-8"?>
<?tapia chrome-version='2.0' ?>
<mapGeo>
   <a>blah</a>
   <b>blah</b>
   <maps>
      <mapIndividual>
         <src>
            <scheme>http</scheme>
            <domain>d.c.b.a</domain>
            <path>somepath</path>
            <port>8085</port>
            <query>blah</query>
         </src>
         <tgt>
            <domain>r.q.p</domain>
            <path>somepath</path>
            <query>blah</query>
         </tgt>
         <x>blah</x>
         <y>blah</y>
      </mapIndividual>
      <mapIndividual>
         <src>
            <scheme>http</scheme>
            <domain>*.c.b.a</domain>
            <path>somepath</path>
            <port>8085</port>
            <query>blah</query>
         </src>
         <tgt>
            <domain>r.q.p</domain>
            <path>somepath</path>
            <query>blah</query>
         </tgt>
         <x>blah</x>
         <y>blah</y>
      </mapIndividual>
      <mapIndividual>
         <src>
            <scheme>tcp</scheme>
            <domain>map.google.com</domain>
            <port>80</port>
            <path>/value</path>
            <query>blah</query>
         </src>
         <tgt>
            <scheme>https</scheme>
            <domain>map.google.com</domain>
            <port>443</port>
            <path>/value</path>
            <query>blah</query>
         </tgt>
         <x>blah</x>
         <y>blah</y>
      </mapIndividual>
      <mapIndividual>
         <src>
            <scheme>https</scheme>
            <domain>photos.yahoo.com</domain>
            <path>somepath</path>
            <query>blah</query>
         </src>
         <loc>C:\var\tmp</loc>
         <x>blah</x>
         <y>blah</y>
      </mapIndividual>
   </maps>
</mapGeo>

Do note: this is ussing the fact that . (dot) precedes and ~ follows (tilde) letters in alphabetical order (at least for US). Also might (sic) not scale well...

I'm with Martin Honnen comment: this would be better solved in XSLT 2.0