Linq-to-XML XElement.Remove() leaves unwanted whit

2019-04-21 23:02发布

问题:

I have an XDocument that I create from a byte array (received over tcp/ip).

I then search for specific xml nodes (XElements) and after retrieving the value 'pop' it off of the Xdocument by calling XElement.Remove(). After all of my parsing is complete, I want to be able to log the xml that I did not parse (the remaining xml in the XDocument). The problem is that there is extra whitespace that remains when XElement.Remove() is called. I want to know the best way to remove this extra whitespace while preserving the rest of the format in the remaining xml.

Example/Sample Code

If I receive the following xml over the socket:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
</catalog>

And I use the following code to parse this xml and remove a number of the XElements:

private void socket_messageReceived(object sender, MessageReceivedEventArgs e)
{
     XDocument xDoc;
     try
     {
         using (MemoryStream xmlStream = new MemoryStream(e.XmlAsBytes))
         using (XmlTextReader reader = new XmlTextReader(xmlStream))
         {
             xDoc = XDocument.Load(reader);
         }

         XElement Author = xDoc.Root.Descendants("author").FirstOrDefault();
         XElement Title  = xDoc.Root.Descendants("title").FirstOrDefault();
         XElement Genre  = xDoc.Root.Descendants("genre").FirstOrDefault();

         // Do something with Author, Title, and Genre here...

         if (Author != null) Author.Remove();
         if (Title  != null) Title.Remove();
         if (Genre  != null) Genre.Remove();

         LogUnparsedXML(xDoc.ToString());

     }
     catch (Exception ex)
     {
         // Exception Handling here...
     }
}

Then the resulting string of xml sent to the LogUnparsedXML message would be:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">



      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
</catalog>

In this contrived example it may not seem like a big deal, but in my actual application the leftover xml looks pretty sloppy. I have tried using the XDocument.ToString overload that takes a SaveOptions enum to no avail. I have also tried to call xDoc.Save to save out to a file using the SaveOptions enum. I did try experimenting with a few different linq queries that used XElement.Nodes().OfType<XText>() to try to remove the whitespace, but often I ended up taking the whitespace that I wish to preserve along with the whitespace that I am trying to get rid of.

Thanks in advance for assistance.

Joe

回答1:

It's not easy to answer in a portable way, because the solution heavily depends on how XDocument.Load() generates whitespace text nodes (and there are several implementations of LINQ to XML around that might disagree about that subtle detail).

That said, it looks like you're never removing the last child (<description>) from the <book> elements. If that's indeed the case, then we don't have to worry about the indentation of the parent element's closing tag, and we can just remove the element and all its following text nodes until we reach another element. TakeWhile() will do the job.

EDIT: Well, it seems you need to remove the last child after all. Therefore, things will get more complicated. The code below implements the following algorithm:

  • If the element is not the last element of its parent:
    • Remove all following text nodes until we reach the next element.
  • Otherwise:
    • Remove all following text nodes until we find one containing a newline,
    • If that node only contains a newline:
      • Remove that node.
    • Otherwise:
      • Create a new node containing only the whitespace found after the newline,
      • Insert that node after the original node,
      • Remove the original node.
  • Remove the element itself.

The resulting code is:

public static void RemoveWithNextWhitespace(this XElement element)
{
    IEnumerable<XText> textNodes
        = element.NodesAfterSelf()
                 .TakeWhile(node => node is XText).Cast<XText>();
    if (element.ElementsAfterSelf().Any()) {
        // Easy case, remove following text nodes.
        textNodes.ToList().ForEach(node => node.Remove());
    } else {
        // Remove trailing whitespace.
        textNodes.TakeWhile(text => !text.Value.Contains("\n"))
                 .ToList().ForEach(text => text.Remove());
        // Fetch text node containing newline, if any.
        XText newLineTextNode
            = element.NodesAfterSelf().OfType<XText>().FirstOrDefault();
        if (newLineTextNode != null) {
            string value = newLineTextNode.Value;
            if (value.Length > 1) {
                // Composite text node, trim until newline (inclusive).
                newLineTextNode.AddAfterSelf(
                    new XText(value.SubString(value.IndexOf('\n') + 1)));
            }
            // Remove original node.
            newLineTextNode.Remove();
        }
    }
    element.Remove();
}

From there, you can do:

if (Author != null) Author.RemoveWithNextWhitespace();
if (Title  != null) Title.RemoveWithNextWhitespace();
if (Genre  != null) Genre.RemoveWithNextWhitespace();

Though I would suggest you replace the above with something like a loop fed from an array or a params method call , to avoid code redundancy.