I have been reading up on htmlspecialchars()
for escaping user input and user input from the database. Before anyone says anything, yes, I am filtering on db input as well as using prepared statements with bindings. I am only concerned about securing the output.
I am confused as to when to use ENT_COMPAT
, ENT_QUOTES
, ENT_NOQUOTES
. I came across the following excerpt while doing my research:
The second argument in the htmlspecialchars()
call is ENT_COMPAT
. I've
used that because it's a safe default: it will also escape
double-quote characters "
. You only really need to do that if you're
outputting inside an HTML attribute (like <img src="<?php echo htmlspecialchars($img_path, ENT_COMPAT, 'UTF-8')">
). You could use
ENT_NOQUOTES
everywhere else.
I have found similar comments elsewhere as well. What is the purpose of converting single and/or double quotes for attributes yet not converting them elsewhere? The only thing I can think of is if you were adding actual html into the page for instance:
My variable is : <img src="somepic.jpg" alt="some text">
if you converted the double quotes here it would not render properly because of the escaped quotes. In the example given in the excerpt though I can't even think of an instance where any type of quote would be used.
Secondly, in this particular reference it says to use ENT_NOQUOTES
everywhere else. Why? My personal thought process is telling me to use ENT_QUOTES
everywhere and ENT_NOQUOTES
if and only if the variable is an actual html attribute that requires them.
I've done lots of searching and reading, but still confused about all of this. My main goal is to secure output to the page so there is no html, php, js manipulation happening.
Just use ENT_QUOTES
everywhere. PHP gives the option in case you need it, but 99% of the time you don't. Escaping the quotes unnecessarily is harmless.
htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
Because that code is just too long to keep writing everywhere wrap it in some tiny function.
function es($string) {
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
Within HTML there are difference contexts where different characters are considered special. For example, within a double-quoted attribute value, a literal double quote would be interpreted as attribute value delimiter:
8.2.4.38 Attribute value (double-quoted) state
Consume the next input character:
↪ U+0022 QUOTATION MARK (")
Switch to the after attribute value (quoted) state.
↪ U+0026 AMPERSAND (&)
Switch to the character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK (").
↪ U+0000 NULL
Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
↪ EOF
Parse error. Switch to the data state. Reconsume the EOF character.
↪ Anything else
Append the current input character to the current attribute's value.
In such a case the double quote needs to be encoded using a character reference. Single-quoted attribute values are similar but here the first literal single quoted is considered the attribute value end delimiter.
Similar does also apply for the data context, i. e., outside a tag:
8.2.4.1 Data state
Consume the next input character:
↪ U+0026 AMPERSAND (&)
Switch to the character reference in data state.
↪ "<" (U+003C)
Switch to the tag open state.
↪ U+0000 NULL
Parse error. Emit the current input character as a character token.
↪ EOF
Emit an end-of-file token.
↪ Anything else
Emit the current input character as a character token.
As you can see, the only character that would be considered harmful in regards of Cross-Site Scripting is <
as it would switch to the tag open context. So this would need to be encoded using a character reference to avoid the injection of a tag.
However, it is also allowed to use character references instead of the literal characters even though they are not special in the corresponding context or even at all. For example, the following are equivalent:
<a href="http://example.com/">
<a href="http://example.com/">
So only certain special characters are really required to be encoded as character references depending on the context but it doesn’t harm to encode other characters that are special in other contexts as well.