I am using MVC 3 and Razor View engine.
What I am trying to do
I am making a blog using MVC 3, I want to remove all HTML formatting tags like <p> <b> <i>
etc..
For which I am using the following code. (it does work)
@{
post.PostContent = post.PostContent.Replace("<p>", " ");
post.PostContent = post.PostContent.Replace("</p>", " ");
post.PostContent = post.PostContent.Replace("<b>", " ");
post.PostContent = post.PostContent.Replace("</b>", " ");
post.PostContent = post.PostContent.Replace("<i>", " ");
post.PostContent = post.PostContent.Replace("</i>", " ");
}
I feel that there definitely has to be a better way to do this. Can anyone please guide me on this.
Thanks Alex Yaroshevich,
Here is what I use now..
post.PostContent = Regex.Replace(post.PostContent, @"<[^>]*>", String.Empty);
The regular expression is slow. use this, it's faster:
public static string StripHtmlTagByCharArray(string htmlString)
{
char[] array = new char[htmlString.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < htmlString.Length; i++)
{
char let = htmlString[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
You can take a look at http://www.dotnetperls.com/remove-html-tags
You can use regular expression.
This article might help you.
Just in case you want to use regex in .NET to strip the HTML tags, the following seems to work pretty well on the source code for this very page. It's better than some of the other answers on this page because it looks for actual HTML tags instead of blindly removing everything between <
and >
. Back in the BBS days, we typed <grin>
a lot instead of :)
, so removing <grin>
is not an option. :)
This solution only removes the tags. It does not remove the contents of those tags in situations where that might be important -- a script tag, for example. You'd see the script, but the script wouldn't execute because the script tag itself gets removed. Removing the contents of an HTML tag is VERY tricky, and practically requires that the HTML fragment be well formed...
Also note the RegexOption.Singleline
option. That's very important for any block of HTML. as there's nothing wrong with opening an HTML tag on one line and closing it in another.
string strRegex = @"</{0,1}(!DOCTYPE|a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|main|map|mark|menu|menuitem|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr){1}(\s*/{0,1}>|\s+.*?/{0,1}>)";
Regex myRegex = new Regex(strRegex, RegexOptions.Singleline);
string strTargetString = @"<p>Hello, World</p>";
string strReplace = @"";
return myRegex.Replace(strTargetString, strReplace);
I'm not saying this is the best answer. It's just an option and it worked great for me.