C# HtmlDecode Specific tags only

2020-03-31 07:35发布

问题:

I have a large htmlencoded string and i want decode only specific whitelisted html tags.

Is there a way to do this in c#, WebUtility.HtmlDecode() decodes everything.

`I am looking for an implementaiton of DecodeSpecificTags() that will pass below test.

    [Test]
    public void DecodeSpecificTags_SimpleInput_True()
    {
        string input = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
        string output = "&lt;span&gt;i am <strong color=blue>very</strong> big <br>man.&lt;/span&gt;";
        List<string> whiteList = new List<string>(){ "strong","br" } ;

        Assert.IsTrue(DecodeSpecificTags(whiteList,input) == output);
    }`

回答1:

You could do something like this

public string DecodeSpecificTags(List<string> whiteListedTagNames,string encodedInput)
{
    String regex="";
    foreach(string s in whiteListedTagNames)
    {
        regex="&lt;"+@"\s*/?\s*"+s+".*?"+"&gt;";
        encodedInput=Regex.Replace(encodedInput,regex);
    }
    return encodedInput;
}


回答2:

A better approach could be to use some html parser like Agilitypack or csquery or Nsoup to find specific elements and decode it in a loop.

check this for links and examples of parsers

Check It, i did it using csquery :

string input = "&lt;span&gt;i am &lt;strong color=blue&gt;very&lt;/strong&gt; big &lt;br&gt;man.&lt;/span&gt;";
string output = "&lt;span&gt;i am <strong color=blue>very</strong> big <br>man.&lt;/span&gt;";

var decoded = HttpUtility.HtmlDecode(output);
var encoded =input ; //  HttpUtility.HtmlEncode(decoded);

Console.WriteLine(encoded);
Console.WriteLine(decoded);

var doc=CsQuery.CQ.CreateDocument(decoded);

var paras=doc.Select("strong").Union(doc.Select ("br")) ;

var tags=new List<KeyValuePair<string, string>>();
var counter=0;

foreach (var element in paras)
{
    HttpUtility.HtmlEncode(element.OuterHTML).Dump();
    var key ="---" + counter + "---";
    var value= HttpUtility.HtmlDecode(element.OuterHTML);
    var pair= new KeyValuePair<String,String>(key,value);

    element.OuterHTML = key ;
    tags.Add(pair);
    counter++;
}

var finalstring= HttpUtility.HtmlEncode(doc.Document.Body.InnerHTML);
finalstring.Dump();

foreach (var element in tags)
{
finalstring=finalstring.Replace(element.Key,element.Value);
}

Console.WriteLine(finalstring);


回答3:

Or you could use HtmlAgility with a black list or white list based on your requirement. I'm using black listed approach. My black listed tag is store in a text file, for example "script|img"

public static string DecodeSpecificTags(this string content, List<string> blackListedTags)
    {
        if (string.IsNullOrEmpty(content))
        {
            return content;
        }
        blackListedTags = blackListedTags.Select(t => t.ToLowerInvariant()).ToList();
        var decodedContent = HttpUtility.HtmlDecode(content);
        var document = new HtmlDocument();
        document.LoadHtml(decodedContent);
        decodedContent = blackListedTags.Select(blackListedTag => document.DocumentNode.Descendants(blackListedTag))
                .Aggregate(decodedContent,
                    (current1, nodes) =>
                        nodes.Select(htmlNode => htmlNode.WriteTo())
                            .Aggregate(current1,
                                (current, nodeContent) =>
                                    current.Replace(nodeContent, HttpUtility.HtmlEncode(nodeContent))));
        return decodedContent;
    }


标签: c# asp.net regex