Remove empty XML tags

2019-01-18 04:23发布

I am looking for a good approach that can remove empty tags from XML efficiently. What do you recommend? Regex? XDocument? XmlTextReader?

For example,

const string original = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
    <pet>
        <cat>Tom</cat>
        <pig />
        <dog>Puppy</dog>
        <snake></snake>
        <elephant>
            <africanElephant></africanElephant>
            <asianElephant>Biggy</asianElephant>
        </elephant>
        <tiger>
            <tigerWoods></tigerWoods>       
            <americanTiger></americanTiger>
        </tiger>
    </pet>";

Could become:

const string expected = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
        <pet>
        <cat>Tom</cat>
        <dog>Puppy</dog>        
        <elephant>                                              
            <asianElephant>Biggy</asianElephant>
        </elephant>                                 
    </pet>";

6条回答
放我归山
2楼-- · 2019-01-18 04:57

This is meant to be an improvement on the accepted answer to handle attributes:

XDocument xd = XDocument.Parse(original);
xd.Descendants()
    .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(a.Value))
            && string.IsNullOrWhiteSpace(e.Value)
            && e.Descendants().SelectMany(c => c.Attributes()).All(ca => ca.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(ca.Value))))
    .Remove();

The idea here is to check that all attributes on an element are also empty before removing it. There is also the case that empty descendants can have non-empty attributes. I inserted a third condition to check that the element has all empty attributes among its descendants. Considering the following document with node8 added:

<root>
  <node />
  <node2 blah='' adf='2'></node2>
  <node3>
    <child />
  </node3>
  <node4></node4>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns='urn://blah' d='a'/>
  <node7 xmlns='urn://blah2' />
  <node8>
     <child2 d='a' />
  </node8>
</root>

This would become:

<root>
  <node2 blah="" adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
  <node8>
    <child2 d='a' />
  </node8>
</root>

The original and improved answer to this question would lose the node2 and node6 and node8 nodes. Checking for e.IsEmpty would work if you only want to strip out nodes like <node />, but it's redunant if you're going for both <node /> and <node></node>. If you also need to remove empty attributes, you could do this:

xd.Descendants().Attributes().Where(a => string.IsNullOrWhiteSpace(a.Value)).Remove();
xd.Descendants()
  .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration))
            && string.IsNullOrWhiteSpace(e.Value))
  .Remove();

which would give you:

<root>
  <node2 adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
</root>
查看更多
相关推荐>>
3楼-- · 2019-01-18 04:58

Loading your original into an XDocument and using the following code gives your desired output:

var document = XDocument.Parse(original);
document.Descendants()
        .Where(e => e.IsEmpty || String.IsNullOrWhiteSpace(e.Value))
        .Remove();
查看更多
▲ chillily
4楼-- · 2019-01-18 04:58

As always, it depends on your requirements.

Do you know how the empty tag will display? (e.g. <pig />, <pig></pig>, etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). Also considering a string.Replace approach seems to be problematic unless your XML doesn't have a certain structure.

Finally, I would recommend using an XML parser approach (make sure your code is valid XML).

var doc = XDocument.Parse(original);
var emptyElements = from descendant in doc.Descendants()
                    where descendant.IsEmpty || string.IsNullOrWhiteSpace(descendant.Value)
                    select descendant;
emptyElements.Remove();
查看更多
Melony?
5楼-- · 2019-01-18 05:12

XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). You can determine if tag is empty using XmlReader.IsEmptyElement property.

XDocument approach which produces desired output:

public static bool IsEmpty(XElement n)
{
    return n.IsEmpty 
        || (string.IsNullOrEmpty(n.Value) 
            && (!n.HasElements || n.Elements().All(IsEmpty)));
}

var doc = XDocument.Parse(original);
var emptyNodes = doc.Descendants().Where(IsEmpty);
foreach (var emptyNode in emptyNodes.ToArray())
{
    emptyNode.Remove();
}
查看更多
▲ chillily
6楼-- · 2019-01-18 05:14

Anything you use will have to pass through the file once at least. If its just a single named tag that you know then regex is your friend otherwise use a stack approach. Start with parent tag and if it has a sub tag place it in stack. If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. If its empty remove it as well. This way you can remove all empty tags including tags with empty children.

If you are after a reg ex expression use this

查看更多
三岁会撩人
7楼-- · 2019-01-18 05:23

XDocument is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small.

XmlTextReader will be faster and use less memory than XDocument when processing very large documents.

Regex is best for handling text rather than XML. It might not handle all edge cases as you would like (e.g. a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML.

查看更多
登录 后发表回答