There are a million cheatsheets all around the tubes that enumerate to different levels of comprehension the character entities specified by various versions and specifications of HTML. I don't want to trust any particular one of them, so I figure I'll toss it out here and see if anyone posts a more authoritative answer.
So, let's assume that I want to match any and all character references and entities using a regular expression. I'd start with /&(?:#(?:x[0-9a-f]+|[0-9]+)|[a-z]{???,???});/i
. But what would go into ???
s? I can think of entities that are two characters long, like lt
and gt
, but are there any one-letter entities in any specifications of the HTML? Likewise, what is the longest entity? Finally, those are the only three syntaxes for expressing literal characters in HTML aside from just typing them directly, are they not?
Cheers!
Longest in HTML5 is ∳
, and there are no one-letter names.
But note that named entity references don't work as you think. Some named character references don't end with a semi-colon, so a regex won't cut the mustard.
The HTML5 spec explicitly describes now, what browsers used to do as error correction since the mid-90s: Show the thing verbatim, if it doesn't match a known character reference. Therefore, if you want your regex to work like a browser, you have to copy the browsers behaviour.
That means, you have to test against a complete list of known references, like the one mentioned by Jukka. You can abbreviate the term with clever use of parentheses,
[aeiou]uml
but you need to bake the same knowledge into the regex, that the browser has, in order to get the same result.
Edit: By the way, named entities might also have numbers in them, e.g., &ensp13;
.
Entity names used to have 2 to 7 letters, following SGML tradition, and this is still the case in the HTML 4.01 specification (and XHTML specifications). But HTML5 drafts add a large number of entities, called named character references there, and some of them are fairly long, like EmptyVerySmallSquare
. So it would be better to avoid any fixed upper limit – or a lower limit larger than 1.