I have following CDATA inside xml document:
<![CDATA[ <p xmlns="">Refer to the below: <br/>
</p>
<table xmlns:abc="http://google.com pic.xsd" cellspacing="1" class="c" type="custom" width="100%">
<tbody>
<tr xmlns="">
<th style="text-align: left">Basic offers...</th>
</tr>
<tr xmlns="">
<td style="text-align: left">Faster network</td>
<td style="text-align: left">
<ul>
<li>Session</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">capabilities</td>
<td style="text-align: left">
<ul>
<li>Navigation,</li>
<li>message, and</li>
<li>contacts</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">Data</td>
<td style="text-align: left">
<p>Here visit google for more info <a href="http://www.google.com" target="_blank"><font color="#0033cc">www.google.com</font></a>.</p>
<p>Remove this href tag <a href="/abc/def/{T}/t/1" target="_blank">Information</a> remove the tag.</p>
</td>
</tr>
</tbody>
</table>
<p xmlns=""><br/>
</p>
]]>
I want to some how scan for href="/abc/def and remove the href tag which starts with abc/def. In above example remove the href tag and just leave "Information" text inside the tag. CDATA can have more than one href tags with "abc/def... in it. I am using C# for this application. Can someone please help me and tell me how this can be done? Should i use regex or is there a way to do it with xml itself?
This is the regex i am trying:
"<a href=\"/abc/def/.*></a>"
I want to keep inner text of the a href tag just remove the tags. But above regex is not working.
Note: I'm not recommending running this Regex on the entire XML string - since most agree this is bad. The following regular expression can and should be run on the individual nodes of the document during proper traversal. The solution was posted as a single regex replacement on the entire xmlString since that was what the user requested and they were having trouble adapting the Regular expression statement to their particular situation - I wrote the code character by character to match how they were intending to use it as closely as possible.
To strip all
href
tags where the url starts with/abc/def/
, you're better off using a regular expression:Followup to comments below
According to MSDN:
This replacement will happen on all instances, not just the first one. If the rest aren't working, it's because there's something different about them that doesn't match the regular expression.
For instance, if there are extra spaces between the a and href in some cases, or the target field is specified before the href field, you would need to use a someone less specific replacement:
If the HTML is well formed XML (which at a glance it looks like) you can load the text of the cdata node into a new XML document, modify the XML as appropriate, and then replace the text of the original cdata node with the XML text of your modified document.
Since cdata is by definition not parsed in the original XML document, that is why you will need a secondary one.
I'd use HtmlAgilityPack for this task. The task itself is quite simple: to select nodes by using xpath, and then remove them. The thing left is to get the result HTML:
This will get you exactly you want.
EDIT:
If you want to remove the
href
attribute only, change this lineanchor.Remove()
to this oneanchor.Attributes["href"].Remove();
Using HtmlAgilityPack