可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to code a secure and lightweight white-list based HTML purifier which will use DOMDocument. In order to avoid unnecessary complexity I am willing to make the following compromises:
- HTML comments are removed
script
and style
tags are stripped all together
- only the child nodes of the
body
tag will be returned
- all HTML attributes that can trigger Javascript events will either be validated or removed
I've been reading a lot about on XSS attacks and prevention and I hope I'm not being too naive (if I am, please let me know!) in assuming that if I follow all the rules I mentioned above, I will be safe from XSS.
The problem is I am not sure what other tags and attributes (in any [X]HTML version and/or browser versions/implementations) can trigger Javascript events, besides the default Javascript event attributes:
onAbort
onBlur
onChange
onClick
onDblClick
onDragDrop
onError
onFocus
onKeyDown
onKeyPress
onKeyUp
onLoad
onMouseDown
onMouseMove
onMouseOut
onMouseOver
onMouseUp
onMove
onReset
onResize
onSelect
onSubmit
onUnload
Are there any other non-default or proprietary event attributes that can trigger Javascript (or VBScript, etc...) events or code execution? I can think of href
, style
and action
, for instance:
<a href="javascript:alert(document.location);">XSS</a> // or
<b style="width: expression(alert(document.location));">XSS</b> // or
<form action="javascript:alert(document.location);"><input type="submit" /></form>
I will probably just remove any style
attributes in the HTML tags, the action
and href
attributes pose a bigger challenge but I think the following code is enough to make sure their value is either a relative or absolute URL and not some nasty Javascript code:
$value = $attribute->value;
if ((strpos($value, ':') !== false) && (preg_match('~^(?:(?:s?f|ht)tps?|mailto):~i', $value) == 0))
{
$node->removeAttributeNode($attribute);
}
So, my two obvious questions are:
- Am I missing any tags or attributes that can trigger events?
- Is there any attack vector that is not covered by these rules?
After a lot of testing, pondering and researching I've come up with the following (rather simple) implementation which, appears to be immune to any XSS attack vector I could throw at it.
I highly appreciate all your valuable answers, thanks.
回答1:
You mention href
and action
as places javascript:
URLs can appear, but you're missing the src
attribute among a bunch of other URL loading attributes.
Line 399 of the OWASP Java HTMLPolicyBuilder is the definition of URL attributes in a white-listing HTML sanitizer.
private static final Set<String> URL_ATTRIBUTE_NAMES = ImmutableSet.of(
"action", "archive", "background", "cite", "classid", "codebase", "data",
"dsync", "formaction", "href", "icon", "longdesc", "manifest", "poster",
"profile", "src", "usemap");
The HTML5 Index contains a summary of attribute types. It doesn't mention some conditional things like <input type=URL value=...>
but if you scan that list for valid URL and friends, you should get a decent idea of what HTML5 adds. The set of HTML 4 attributes with type %URI
is also informative.
Your protocol whitelist looks very similar to the OWASP sanitizer one. The addition of ftp
and sftp
looks innocuous enough.
A good source of security related schema info for HTML element and attributes is the Caja JSON whitelists which are used by the Caja JS HTML sanitizer.
How are you planning on rendering the resulting DOM? If you're not careful, then even if you strip out all the <script>
elements, an attacker might get a buggy renderer to produce content that a browser interprets as containing a <script>
element. Consider the valid HTML that does not contain a script element.
<textarea></textarea><script>alert(1337)</script></textarea>
A buggy renderer might output the contents of this as:
<textarea></textarea><script>alert(1337)</script></textarea>
which does contain a script element.
(Full disclosure: I wrote chunks of both HTML sanitizers mentioned above.)
回答2:
Garuda has already given what I would deem as the "correct" answer, and his links are very useful, but he beat me to the punch!
I give my answer only to reinforce.
In this day and age of increasing features in the html and ecmascript specs, avoiding script injection and other such vulnerabilities in html becomes more and more difficult. With each new addition, a whole world of possible injections is introduced. This is coupled with the fact that different browsers probably have different ideas of how they are going to implement these specs, so you get even more possible vulnerabilities.
Take a look at a short list of vectors introduced by html 5
The best solution is choose what you will allow rather than what you will deny. It is much easier to say "These tags and these attributes for those given tags alone are allowed. Everything else will sanitized accordingly or thrown out."
It would be very irresponsible for me to compile a list and say "okay, here you go: here's a list of all of the injection vectors you missed. You can sleep easy." In fact, there are probably many injection vectors that are not even known by black hats or white hats. As the ha.ckers website states, script injection is really only limited by the mind.
I'd like to answer your specific question at least a little bit, so here are some glaring omissions from your blacklist:
img
src
attribute. I think it is important to note that src
is a valid attribute on other elements and could be potentially harmful. img
also dynsrc
and lowsrc
, maybe even more.
type
and language
attributes
CDATA
in addition to just html comments.
- Improperly sanitized input values. This may not be a problem depending upon how strict your html parsing is.
- Any ambiguous special characters. In my opinion, even unambiguous ones should probably be encoded.
- Missing or incorrect quotes on attributes (such as grave quotes).
- Premature closing of textarea tags.
- UTF-8 (and 7) encoded characters in scripts
- Even though you will only return child nodes of the body tag, many browsers will still evaluate
head
, and html
elements inside of body
, and most head
-only elements inside of body
anyway, so this probably won't help much.
- In addition to css expressions, background image expressions
frame
s and iframe
s
embed
and probably object
and applet
- Server side includes
- PHP tags
- Any other injections (SQL Injection, executable injection, etc.)
By the way, I'm sure this doesn't matter, but camelCased attributes are invalid xhtml and should be lower cased. I'm sure this doesn't affect you.
回答3:
You might want to check these 2 links out for additional reference:
http://adamcecc.blogspot.com/2011/01/javascript.html (this is only applicable when you're 'filtered' input is ever going to find itself between script tags on a page)
http://ha.ckers.org/xss.html (which has a lot of browser-specific event triggers listed)
I've used HTML Purifier, as you are doing, for this reason too in combination with a wysiwyg-editor. What i did different is using a very strict whitelist with a couple of basic markup tags and attributes available and expanding it when the need arose. This keeps you from getting attacked by very obscure vectors (like the first link above) and you can dig in on the newly needed tag/attribute one by one.
Just my 2 cents..
回答4:
Don't forget the HTML5 JavaScript event handlers
http://www.w3schools.com/html5/html5_ref_eventattributes.asp