I am looking for a good approach that can remove empty tags from XML efficiently. What do you recommend? Regex? XDocument? XmlTextReader?
For example,
const string original =
@"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<pig />
<dog>Puppy</dog>
<snake></snake>
<elephant>
<africanElephant></africanElephant>
<asianElephant>Biggy</asianElephant>
</elephant>
<tiger>
<tigerWoods></tigerWoods>
<americanTiger></americanTiger>
</tiger>
</pet>";
Could become:
const string expected =
@"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<dog>Puppy</dog>
<elephant>
<asianElephant>Biggy</asianElephant>
</elephant>
</pet>";
This is meant to be an improvement on the accepted answer to handle attributes:
The idea here is to check that all attributes on an element are also empty before removing it. There is also the case that empty descendants can have non-empty attributes. I inserted a third condition to check that the element has all empty attributes among its descendants. Considering the following document with node8 added:
This would become:
The original and improved answer to this question would lose the
node2
andnode6
andnode8
nodes. Checking fore.IsEmpty
would work if you only want to strip out nodes like<node />
, but it's redunant if you're going for both<node />
and<node></node>
. If you also need to remove empty attributes, you could do this:which would give you:
Loading your original into an
XDocument
and using the following code gives your desired output:As always, it depends on your requirements.
Do you know how the empty tag will display? (e.g.
<pig />
,<pig></pig>
, etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). Also considering astring.Replace
approach seems to be problematic unless your XML doesn't have a certain structure.Finally, I would recommend using an XML parser approach (make sure your code is valid XML).
XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). You can determine if tag is empty using
XmlReader.IsEmptyElement
property.XDocument approach which produces desired output:
Anything you use will have to pass through the file once at least. If its just a single named tag that you know then regex is your friend otherwise use a stack approach. Start with parent tag and if it has a sub tag place it in stack. If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. If its empty remove it as well. This way you can remove all empty tags including tags with empty children.
If you are after a reg ex expression use this
XDocument
is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small.XmlTextReader
will be faster and use less memory than XDocument when processing very large documents.Regex is best for handling text rather than XML. It might not handle all edge cases as you would like (e.g. a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML.