Cleaning all inline events from HTML tags

2019-03-30 18:36发布

问题:

For HTML input, I want to neutralize all HTML elements that have inline js (onclick="..", onmouseout=".." etc). I am thinking, isn't it enough to encode the following chars? =,(,)

So onclick="location.href='ggg.com'"
will become onclick%3D"location.href%3D'ggg.com'"

What am I missing here?

Edit: I do need to accept active HTML (I can't escape it all or entities is it).

回答1:

There's no simple method to accept HTML, but not scripts.

You have to parse HTML to DOM, remove all unwanted elements and attributes in DOM and generate new HTML.

It can't be done reliably with regular expressions.

on* attributes are not enough. Scripts can be embedded in style, src, href and other attributes.

If you're using PHP, then use HTML Purifier.



回答2:

You probably have a couple of options... easiest way is to convert quotes, and possibly <> characters, to their HTML encoded equivalents (" etc.), which will result in the HTML code being displayed literally.

Tell me what server-side language are you using and I can point you towards more language-specific information, if you like. (For example, PHP has htmlspecialchars()[1]).

EDIT: I just actually read your question. Okay, you want to allow HTML through but no JavaScript? Well, for lack of a simple solution jumping to my mind, I suggest just using string replacement (regular expressions if you can, maybe?) to get rid of them entirely.

There are a finite set of event handler attributes in JavaScript. Couple that with the need for quotation marks and you're probably good.

For proof of concept, in Perl, you'd probably do something like this:

$myInput =~ s/on(mouseover|mouseout|click|focus|blur|[...])(\"[^\"]*\")|(\'[^\']*\')\s*//gi;

So, capture the event handler name (only some of which I included), then a quoted expression using either single or double quotes, have optional whitespace on the end, and replace the entire thing with nothing (i.e., delete it).

That won't work for something requiring more levels of quotation, though, since eventually you would come back to the original delimiters. Forgive the contrived and completely useless example:

onclick="eval('3+prompt("Enter a number: ")')"

In THAT case, you might want to write a loop that parses the string first by word (i.e., looking for the event handler name), then going character by character, keeping track of the number of quoting levels as you go and keeping track of the current delimiter:

  1. Mark the index of the beginning of the handler name (the "o" in onclick, etc.)
  2. Start with quoting level 0 (or 1 after you've processed the opening quotation delimiter).
  3. If the current delimiter is " and you see ', then increase the quoting level by 1 and switch current delimiter to '.
  4. If the current delimiter is " and you see ", decrease the quoting level by 1 and switch current delimiter to '.
  5. If the current delimiter is ' and you see ", then increase the quoting level by 1 and switch current delimiter to '.
  6. If the current delimiter is ' and you see ', decrease the quoting level by 1 and switch current delimiter to '.
  7. If the quoting level gets back down to 0, then your string has ended. Mark the index of where the string ends.
  8. Use a string manipulation function to cut out the substring from the first index to the last index.

It's a little more time-consuming, but it should theoretically work no matter what, assuming the HTML is well-formed. (That's a horrible assumption, but if it's not well-formed you could just reject the input anyway!)

[1] http://us3.php.net/manual/en/function.htmlspecialchars.php