How to remove href tag from CDATA

2019-07-03 04:15发布

问题:

I have following CDATA inside xml document:

<![CDATA[ <p xmlns="">Refer to the below: <br/>
</p>
<table xmlns:abc="http://google.com pic.xsd" cellspacing="1" class="c" type="custom" width="100%">
    <tbody>
        <tr xmlns="">            
            <th style="text-align: left">Basic offers...</th>
        </tr>
        <tr xmlns="">
            <td style="text-align: left">Faster network</td>
            <td style="text-align: left">
            <ul>                
                <li>Session</li>
            </ul>
            </td>
        </tr>
        <tr xmlns="">
            <td style="text-align: left">capabilities</td>
            <td style="text-align: left">
            <ul>                
                <li>Navigation,</li>
                <li>message, and</li>
                <li>contacts</li>
            </ul>
            </td>
        </tr>
        <tr xmlns="">
            <td style="text-align: left">Data</td>
            <td style="text-align: left">
            <p>Here visit google for more info <a href="http://www.google.com" target="_blank"><font color="#0033cc">www.google.com</font></a>.</p>
            <p>Remove this href tag <a href="/abc/def/{T}/t/1" target="_blank">Information</a> remove the tag.</p>
            </td>
        </tr>
    </tbody>
</table>
<p xmlns=""><br/>
</p>
  ]]> 

I want to some how scan for href="/abc/def and remove the href tag which starts with abc/def. In above example remove the href tag and just leave "Information" text inside the tag. CDATA can have more than one href tags with "abc/def... in it. I am using C# for this application. Can someone please help me and tell me how this can be done? Should i use regex or is there a way to do it with xml itself?

This is the regex i am trying:

"<a href=\"/abc/def/.*></a>"

I want to keep inner text of the a href tag just remove the tags. But above regex is not working.

回答1:

Using HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var nodes = doc.DocumentNode
    .Descendants("a")
    .Where(n => n.Attributes.Any(a => a.Name == "href" && a.Value.StartsWith("/abc/def")))
    .ToArray();

foreach(var node in nodes)
{
    node.ParentNode.RemoveChild(node,true);
}

var newHtml = doc.DocumentNode.InnerHtml;


回答2:

I'd use HtmlAgilityPack for this task. The task itself is quite simple: to select nodes by using xpath, and then remove them. The thing left is to get the result HTML:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

var doc = new HtmlDocument();
doc.LoadHtml(xml);

var anchors = doc.DocumentNode.SelectNodes("//a[starts-with(@href, '/abc/def')]");
foreach (var anchor in anchors.ToList())
    anchor.Remove();

var result= doc.DocumentNode.OuterHtml;

This will get you exactly you want.

EDIT:

If you want to remove the href attribute only, change this line anchor.Remove() to this one anchor.Attributes["href"].Remove();



回答3:

If the HTML is well formed XML (which at a glance it looks like) you can load the text of the cdata node into a new XML document, modify the XML as appropriate, and then replace the text of the original cdata node with the XML text of your modified document.

Since cdata is by definition not parsed in the original XML document, that is why you will need a secondary one.



回答4:

Note: I'm not recommending running this Regex on the entire XML string - since most agree this is bad. The following regular expression can and should be run on the individual nodes of the document during proper traversal. The solution was posted as a single regex replacement on the entire xmlString since that was what the user requested and they were having trouble adapting the Regular expression statement to their particular situation - I wrote the code character by character to match how they were intending to use it as closely as possible.


To strip all href tags where the url starts with /abc/def/, you're better off using a regular expression:

result = Regex.Replace(xmlString, @"<a href=\"/abc/def/.*>(.*)</a>", "$1");

Followup to comments below

According to MSDN:

Within a specified input string, replaces all strings that match a specified regular expression with a specified replacement string.

This replacement will happen on all instances, not just the first one. If the rest aren't working, it's because there's something different about them that doesn't match the regular expression.

For instance, if there are extra spaces between the a and href in some cases, or the target field is specified before the href field, you would need to use a someone less specific replacement:

result = Regex.Replace(str, @"<a.*href=\"/OST/OSTdisplay/.*>(.*)</a>", "$1");