I'm using an '&
' symbol with HTML5 and UTF-8 in my site's <title>
. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as
&
.)
Do I really need to do &
?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding
&
as&
under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.Compare the following: which is easier? which is easier to bugger up?
Methodology 1
Methodology 2
(with a grain of salt, please ;) )
volt & amp
> In that case don't bother encoding it.
amp&volt
> In that case don't bother encoding it.
volt&
> Encode it.
??
if
&
is used in html then you should escape itIf
&
is used in javascript strings e.g. analert('This & that');
or document.href you don't need to use it.If you're using document.write then you should use it e.g.
document.write(<p>this & that</p>)
Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using
&
by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as&
and everything would be fine.HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.
I’ve researched this thoroughly and wrote about my findings here: http://mathiasbynens.be/notes/ambiguous-ampersands
I’ve also created an online tool that you can use to check your markup for ambiguous ampersands or character references that don’t end with a semicolon, both of which are invalid. (No HTML validator currently does this correctly.)
I was checking why Image URL's need escaping, hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URL's need to be escaped. [PS:I guess it will unescaped when its consumed since URL's need
&
. Can anyone clarify?]If you're really talking about the static text
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple
&
, then chances are you also don't escape a&
or a
or<b>
or<script src="http://attacker.com/evil.js">
or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.