Is there a security risk in leaving ampersands une

2019-02-23 23:56发布

问题:

Is there any security risk in escaping other special characters but leaving ampersands untouched when displaying user-generated/submitted information? I'd like to let my user input html entities, hex, and decimal special characters freely without adding unnecessary complexity to my sanitizer.

回答1:

tldr; Leaving in ampersands (or other "special characters") is not a security issue if coded correctly. That is, the output/use is of importance, not the input.

It all depends on how the data is used in the end. Doing a <input value="<? echo $input ?>" /> is not correctly coded, for arbitrary input, for instance.

Now an & is often much less of a "problem" than some other characters (say ', ", < or >), but it could cause some artifacts (including errors and undefined behavior) in some situations, or perhaps be used for adding an extra query parameter to a URL

  • .. but if the URL is not encoded as appropriate when output, then it's not correctly coded 1
  • .. and of course if a & is written verbatim into an XML/HTML stream, then it's not correctly coded 2
  • .. and if the program is passing in raw & [from user input] to a "shell string-execute" then it's [very likely] not correctly coded 3
  • .. it all comes down to use.

I tend to not alter the input, excepting to make it conform to business rules - and this does not include the above mentioned case! (But it may be a perfectly valid business rule to not accept an ampersand at all.)

Proper escaping (or, better yet, approaches that don't require [manual] escaping) at the appropriate times takes care of the rest and ensures that, through good coding of the usage, trivial attacks or accidental blunders are mitigated.

In fact, I would argue that this sort of "input sanitization" shows a lack of trust in the approaches/code used elsewhere and can lead to more problems with needing to undo the "sanitization". Magic quotes anyone?


1 This is a case of where an & in the user input can actually cause a form of injection. Imagine: format("http://site/view={0}", user_input), where user_input contains 1&buy=1. The result will be "http://site/view=1&buy=1". The correct method is to URI-encode (aka Percent encode) the value, which would have resulted in "http://site/view=1%26buy%3D1". (Note that there is only one query parameter in the correctly coded case. If the intent is to be able to allow "raw" input to be passed through, then carefully define/analyze the permissible rules and see the following paragraph.)

2 While a "bare" & can be valid in an HTML stream user input should not be relied upon as "being valid HTML". That is, regardless of targeting XML or HTML the correct output/rendering escaping mechanism should be used. (The escaping mechanism might choose to not encode "bare" &'s, but that is a secondary concern. The lazy programmer will continue to use the same escaping techniques for all applicable output to get consistent, reliable, and safe output.)

3 Instead of using a shell-execute that takes a single string of arguments that must be parsed, use an exec-form takes in a list of arguments. The latter [generally] prevents against spawning a shell and the associated shell-hacks. And, of course, never let the user manually specify the executable ..



回答2:

It all depends on the context the data is put into.

In HTML, the main reason to represent a plain & by a character reference is to avoid ambiguity as the & is also the begin of such a character reference. A popular example for such ambiguity is a plain & as part of a URL parameter in an HTML attribute like this:

<a href="/?lang=en&sect=foobar">

Here the & is not encoded appropriately with a corresponding character reference so the parser treats it as the begin of a character reference. And since sect is a known entity in HTML, representing the section character §, this attribute value is actually interpreted as /?lang=en§=foobar.

So leaving a plain & as it is does not prone an actual threat like other special characters in HTML do as they can change the context the data is put into:

  • the tag delimiters < and > can start or end a tag declaration,
  • the attribute value delimiters " and ' can start or end an attribute value declaration.

To be on the safe side, you should use htmlspecialchars with the double_encode parameter set to false to avoid a double encoding of already existing character references:

var_dump(htmlspecialchars('<"&amp;\'>', ENT_QUOTES, 'UTF-8', false) === '&lt;&quot;&amp;&#039;&gt;'); // bool(true)