I'm testing one of my web application using Acunetix. To protect this project against XSS attacks, I used HTML Purifier. This library is recommended by most of PHP developers for this purpose, but my scan results shows HTML Purifier can not protect us from XSS attacks completely. The scanner found two ways of attack by sending different harmful inputs:
1<img sRc='http://attacker-9437/log.php?
(See HTML Purifier result here)1"onmouseover=vVF3(9185)"
(See HTML Purifier result here)
As you can see results, HTML Purifier could not detect such attacks. I don't know if is there any specific option on HTML Purifier to solve such problems, or is it really unable to detect these methods of XSS attacks.
Do you have any idea? Or any other solution?
All the HTML purifier seems to be doing, from the brief look that I gave, was HTML encode certain characters such as<
,>
and so on. However there are other means of invoking JS without using the normal HTML characters:Please review comments (by @pinkgothic) below.
Points below:
<img>
tag, point thesrc
to some non-existent file which in turn raises an error. That can then be handled by theonerror
handler to run some JavaScript code. Take the following example:<img src=x onerror=alert(document.domain)>
The entrypoint for this it generally accompanied by prematurely closing another tag on an input. For example (URL decoded for clarity):
This however, is easily mititgated by HTML escaping meta-character (i.e.
<
,>
).<img src="$USER_DEFINED">
A normal example would be:
<img src="http://example.com/img.jpg">
However, inserting the above payload, we cut off the
src
attribute which points to a non-existent file and inject anonerror
handler:<img src="1"onerror=alert(document.domain)">
This executes the same payload mentioned above.
Remediation
This is heavily documented and tested in multiple places, so I won't go into detail. However, the following two articles are great on the subject and will cover all your needs:
(This is a late answer since this question is becoming the place duplicate questions are linked to, and previously some vital information was only available in comments.)
HTML Purifier is a contextual HTML sanitiser, which is why it seems to be failing on those tasks.
Let's look at why in some detail:
1<img sRc='http://attacker-9437/log.php?
You'll notice that HTML Purifier closed this tag for you, leaving only an image injection. An image is a perfectly valid and safe tag (barring, of course, current image library exploits). If you want it to throw away images entirely, consider adjusting the HTML Purifier whitelist by setting HTML.Allowed.
That the image from the example is now loading a URL that belongs to an attacker, thus giving the attacker the IP of the user loading the page (and nothing else), is a tricky problem that HTML Purifier wasn't designed to solve. That said, you could write a HTML Purifier attribute checker that runs after purification, but before the HTML is put back together, like this:
The
HTMLPurifier_AttrTransform_CheckURL
class would need to have a structure like this:Of course, it's difficult to do this 'right':
1"onmouseover=vVF3(9185)"
HTML Purifier assumes the context your HTML is set in is a
<div>
(unless you tell it otherwise by setting HTML.Parent).If you just feed it an attribute value, it's going to assume you're going to output this somewhere so the end-result looks like this:
That's why it appears to not be doing anything about this input - it's harmless in this context. You might even not want to strip this information in that context. I mean, we're talking about this snippet here on stackoverflow, and that's valuable (and not causing a security problem).
Context matters. Now, if you instead feed HTML Purifier this snippet:
...suddenly you can see what it's made to do:
Now it's removed the injection, because in this context, it would have been malicious.
What to use HTML Purifier for and what not
So now you're left to wonder what you should be using HTML Purifier for, and when it's the wrong tool for the job. Here's a quick run-down:
htmlspecialchars($input, ENT_QUOTES, 'utf-8')
(or whatever your encoding is) if you're outputting into a HTML document and aren't interested in preserving HTML at all - it's unnecessary overhead and it'll let some things throughhtmlspecialchars($input, ENT_QUOTES, 'utf-8')
if you're outputting into a HTML attribute (HTML Purifier is not meant for this use-case)You can find some more information about sanitising / escaping by context in this question / answer.