Sort elements of arbitrary XML document recursivel

2019-01-26 13:51发布


I'm trying to sort and canonicalize some XML documents. The desired end result is that:

  1. every element's children are in alphabetical order
  2. every elements attributes are in alphabetical order
  3. comments are removed
  4. all elements are properly spaced (i.e. "pretty print").

I have achieved all of these goals except #1.

I have been using this answer as my template. Here is what I have so far:

import javax.xml.transform.TransformerFactory

// Initialize the security library

// Create some variables

// Get arguments

// Make sure required arguments have been provided

if(!error) {
    // Create some variables
    def ext = fileInName.tokenize('.').last()
    fileOutName = fileOutName ?: "${fileInName.lastIndexOf('.').with {it != -1 ? fileInName[0..<it] : fileInName}}_CANONICALIZED_AND_SORTED.${ext}"
    def fileIn = new File(fileInName)
    def fileOut = new File(fileOutName)
    def xsltFile = new File(xsltName)
    def temp1 = new File("./temp1")
    def temp2 = new File("./temp2")
    def os
    def is

    // Sort the XML attributes, remove comments, and remove extra whitespace
    println "Canonicalizing..."
    Canonicalizer c = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS)
    os = temp1.newOutputStream()

    // Sort the XML elements
    println "Sorting..."
    def factory = TransformerFactory.newInstance()
    is = xsltFile.newInputStream()
    def transformer = factory.newTransformer(new StreamSource(is))
    is = temp1.newInputStream()
    os = temp2.newOutputStream()
    transformer.transform(new StreamSource(is), new StreamResult(os))

    // Write the XML output in "pretty print"
    println "Beautifying..."
    def parser = new XmlParser()
    def printer = new XmlNodePrinter(new IndentPrinter(fileOut.newPrintWriter(), "    ", true))
    printer.print parser.parseText(temp2.getText())

    // Cleanup

    println "Done!"

Full script is here.


<xsl:stylesheet version="1.0" xmlns:xsl="">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="node()|@*">
      <xsl:apply-templates select="node()|@*"/>
  <xsl:template match="foo">
        <xsl:sort select="name()"/>

Sample Input XML:

<foo b="b" a="a" c="c">
    <zxcv c="c" b="b"/>
    <vcxz c="c" b="b"/>
    <baz e="e" d="d"/>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>

Desired Output XML:

<foo a="a" b="b" c="c">
        <asdf f="f" g="g"/>
        <fdsa f="f" g="g"/>
    <baz d="d" e="e"/>
        <vcxz b="b" c="c"/>
        <zxcv b="b" c="c"/>

How can I make the transform apply to all elements so all of an element's children will be in alphabetical order?


If you want to make the transform apply to all elements, you need a template to match all elements, as opposed to having a template that just matches the specific "foo" element

<xsl:template match="*">

Note that, you would have to change the current template that matches "node()" to exclude elements:

 <xsl:template match="node()[not(self::*)]|@*">

Within this template, you will also need code to select the attributes, because your "foo" template at the moment will ignore them (<xsl:apply-templates /> does not select attributes).

Actually, looking at your requirements, items 1 to 3 can all be done with a single XSLT. For example, to remove comments, you could just ignore it from the template that currently matches node()

<xsl:template match="node()[not(self::comment())][not(self::*)]|@*">

Try the following XSLT, will should achieve points 1 to 3

<xsl:stylesheet version="1.0" xmlns:xsl="">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()[not(self::comment())][not(self::*)]|@*">
      <xsl:apply-templates select="node()|@*"/>

  <xsl:template match="*">
      <xsl:apply-templates select="@*">
        <xsl:sort select="name()"/>
        <xsl:sort select="name()"/>

EDIT: The template <xsl:template match="node()[not(self::comment())][not(self::*)]|@*"> can actually be replaced with just <xsl:template match="processing-instruction()|@*"> which may increase readability. This is because "node()" matches elements, text nodes, comments and processing instructions. In your XSLT, elements are picked up by the other template, text nodes by the built-in template, and comments you want to ignore, leaving just processing instructions.


For fun, you can also do this programatically:

def x = '''<foo b="b" a="a" c="c">
    <!-- A comment -->
    <zxcv c="c" b="b"/>
    <vcxz c="c" b="b"/>
    <baz e="e" d="d"/>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>

def order( node ) {
    [ *:node.attributes() ].sort().with { attr ->
        attr.each { node.attributes() << it }
    node.children().sort { }
                   .each { order( it ) }

def doc = new XmlParser().parseText( x )

println groovy.xml.XmlUtil.serialize( order( doc ) )

If your nodes have content, then you need to change to:

def x = '''<foo b="b" a="a" c="c">
    <!-- A comment -->
    <zxcv c="c" b="b">Some Text</zxcv>
    <vcxz c="c" b="b"/>
    <baz e="e" d="d">Woo</baz>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>

def order( node ) {
    [ *:node.attributes() ].sort().with { attr ->
        attr.each { node.attributes() << it }
                   .grep( Node )
                   .each { order( it ) }

def doc = new XmlParser().parseText( x )

println groovy.xml.XmlUtil.serialize( order( doc ) )

Which then gives:

<?xml version="1.0" encoding="UTF-8"?><foo a="a" b="b" c="c">
  <baz d="d" e="e">Woo</baz>
    <fdsa f="f" g="g"/>
    <asdf f="f" g="g"/>
    <zxcv b="b" c="c">Some Text</zxcv>
    <vcxz b="b" c="c"/>

标签: xml xslt groovy