Platform: ASP.NET 4.0 MVC 4 C# jQuery
Here's what I want to do.
I'm building a simple forum for my product. I want to give users a text area to enter their posts or comments.
- I'd like to allow basic text formatting HTML and links - like p, a, b, i
- Don't want any other html styling - i.e. div, span, etc. etc.
- Don't want any scripting access
Is there a clever way to do this? I could, for e.g., allow unsafe text and examine it on the server side but I doubt I'd be able to clean it up correctly and might open security holes.
Preferably want to avoid heavy duty plugins.
Thanks!
(PS - my worst fallback is that I allow safe text only, i.e. keep the ASP.NET security on, and then use a special markup for links - like [link] [b] [i])
No matter what approach you use, you need to assume everything entered into the field is malicious, i.e. don't trust any data.
I wouldn't bother too much with any client validation in JavaScript/jQuery. It'll be complex and only need to be redone server side.
Server side you want to take a whitelist approach, i.e. if it's not on the list, it's invalid. You wouldn't be able to use a XML processor because the user's text may not result in valid XML, instead you'd probably want to use a regular expression.
I would define a set of tags that are valid (you've said p, a, b and i but I would be weary of the last two as you'd almost never get them in 'wild' html), I would then define if and which attributes are valid for these tags. I'm guessing you'd want at the very least a href on the a.
You could strip any text within tags that doesn't match... my regex skills aren't great, but this appears to find all the tags you want to keep, it needs to be inverted.
\<a\shref\=".[^\"]*\"\>|\</?[abip]\s?\>
In .NET 4.5+ or by adding System.Web.Security.AntiXss
to the older version of .NET, there is a good way to address this issue. We can use [AllowHtml]
and a custom annotation attribute together. The approach should whitelist the HTML tags inside the string and validate the request.
Here is the custom annotation attribute for this job:
[AttributeUsage(AttributeTargets.Property | AttributeTargets.Field, Inherited = true, AllowMultiple = false)]
public sealed class RemoveScriptAttribute : ValidationAttribute
{
public const string DefaultRegexPattern = @"\<((?=(?!\b(a|b|i|p)\b))(?=(?!\/\b(a|b|i|p)\b))).*?\>";
public string RegexPattern { get; }
public RemoveScriptAttribute(string regexPattern = null)
{
RegexPattern = regexPattern ?? DefaultRegexPattern;
}
protected override ValidationResult IsValid(object value, ValidationContext ctx)
{
var valueStr = value as string;
if (valueStr != null)
{
var newVal = Regex.Replace(valueStr, RegexPattern, "", RegexOptions.IgnoreCase, new TimeSpan(0, 0, 0, 0, 250));
if (newVal != valueStr)
{
var prop = ctx.ObjectType.GetProperty(ctx.MemberName);
prop.SetValue(ctx.ObjectInstance, newVal);
}
}
return null;
}
}
Then you should decorate the model property that you want HTML in it with both [AllowHtml] and [RemoveScript] attribute, like this:
public class MyModel
{
[AllowHtml, RemoveScript]
public string StringProperty { get; set; }
}
This will allow only <a>, <b>, <i>, and <p> html tags to get it. All other tags will be removed, however it is smart enough to keep the inner text of the tags. E.g. if you send:
"This is a <b>rich text<b> entered by <u>John Smith</u>."
you will end up getting this:
"This is a <b>rich text<b> entered by John Smith."
It is also easy to whitelist more HTML tags. E.g. if you want to accept <u></u>,
<br />, and <hr />, change the DefaultRegexPattern
(affects globally) or pass a modified regexPattern to an instance of RemoveScriptAttribute
, like this:
[AllowHtml]
[RemoveScript(regexPattern: @"\<((?=(?!\b(a|b|i|p|u|br|hr)\b))(?=(?!\/\b(a|b|i|p|u)\b))).*?\>")]
public string Body { get; set; }
There are tons of online editors available for you to use. I typed "online text editor free" into google and got a bunch of editors to review.
If you must use html in your markup then you are going to want to parse the submitted text to reject the text when you find tags that aren't "safe".
FYI this might be of interest to you
https://meta.stackexchange.com/questions/121981/stackoverflow-official-wmd-editor
I marked joocer's answer as 'answer' as it helped me form my opinion (though what he said wasn't what I did in the end)
I decided on a simple rule - I would linkify http://.... links and disallow any other html (and it's fine for my application). That way, I let ASP.NET framework do all the error check and disallow any HTML markup. Then when I rendered the text on the client, I only recognized and modified the http:// link by decorating with a markup while HTML safe encoding everything else.