I have two XML files (XSD) which are generated by some tool.
The tool doesn't preserve the order of elements so although the content is equal comparing it as text will result as the files are different.
Is there some tool that can sort the elements before comparing and will enable text comparison of the documents?
Of course the sorting needs to be done recursively.
Data example:
File A:
<xml>
<A/>
<B/>
</xml>
File B:
<xml>
<B/>
<A/>
</xml>
I had a similar problem and I eventually found: http://superuser.com/questions/79920/how-can-i-diff-two-xml-files
That post suggests doing a canonical XML sort then doing a diff. The following should work for you if you are on Linux, Mac, or if you have Windows with something like Cygwin installed:
$ xmllint --c14n FileA.xml > 1.xml
$ xmllint --c14n FileB.xml > 2.xml
$ diff 1.xml 2.xml
Have a look at Using XSLT to Assist Regression Testing that describe a solution using xslt
The XML samples are fundamentally different. Even though the content and the hierarchy may be identical the relationships between peers is different. When XML is parsed it is parsed into a structure called a DOM where relationships between units is very important. If you want to discount the nature of relationships between peer entities then you will likely need custom software. I recommend finding some simple open-source XML aware diff tool and adding the additional requirements that you need. I wrote one at http://prettydiff.com/ but I suggest you look around to see what is available before making a decision, because editing somebody else's algorithms may require a bit of heavy lifting.
You can use the perl module DifferenceMarkup http://metacpan.org/pod/XML::DifferenceMarkup or the xmldiff pecl.php.net/xmldiff extension in PHP. Both will produce a human readable XML diff document.
For what it's worth, I have created a java tool (or kotlin actually) for effecient and configurable canonicalization of xml files.
It will always:
- Sort nodes and attributes by name.
- Remove namespaces (yes - it could - hypothetically - be a problem).
- Prettyprint the result.
In addition you can tell it to:
- Remove a given list of node names - maybe you do not want to know that the value of a piece of metadata - say
<RequestReceivedTimestamp>
has changed.
- Sort a given list of collections in the context of the parent - maybe you do not care that the order of
<Contact>
entries in <ListOfFavourites>
has changed.
It uses XSLT and does all the above efficiently using chaining.
Limitations
It does support sorting nested lists - sorting innermost lists before outer. But it cannot reliably sort arbitrary levels of recursively nested lists.
If you have such needs you can - after having used this tool - compare the sorted byte arrays of the results. they will be equal if only list sorting issues remain.
Where to get it
You can get it here: XMLNormalize