Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags
.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
>
in attribute values. This solution will return broken markup when encountering such values.Use a proper parser if you must get it right under all circumstances.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.