Why so much HTML input sanitization necessary?

2020-07-27 04:12发布

问题:

I have implemented a search engine in C for my html website. My entire web is programmed in C.

I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):

<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>

Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !

回答1:

Input sanitisation is not inherently ‘necessary’.

It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).

But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.

HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape < to &lt;, & to &amp;, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.

If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include <, & and " characters.

And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.

Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ?

Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use <, & or "! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!

< is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.

My entire web is programmed in C.

I'm so sorry.



回答2:

Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.