When I copy paste this Wikipedia article it looks like this.
http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29
However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.
Are these abbreviations for Unicode and special URL characters?
I'm use to seeing \u00ff, etc. in JavaScript.
It is important to note the % sign servers two primary purposes. One is to encode special characters and the other is to encode Unicode characters outside of what you can put in with your hardware/keyboard. For example
%C3%A8
to encodeè
, and whatever encoding represents a forward slash/
.Using JavaScript we can create a encoding chart:
http://jsfiddle.net/CG8gx/3/
It's just a different syntactical convention for what you're used to from JavaScript. URL syntax is simply different from that of JavaScript, in other words, and
%
is the way one introduces a two-hex-digit character code in that syntax.Some characters must be escaped in order to be part of a URL/URI. For example, the
/
character has meaning; it's a metacharacter, in other words. If you need a/
in the middle of a path component (which admittedly would be a little weird), you'd have to escape it. It's analogous to the need to escape quote characters in JavaScript string constants.%
in a URI is followed by two characters from0-9A-F
, and is the escaped version of writing the character with that hex code. Doing this means you can write a URI with characters that might have special meaning in other languages.Common examples are
%20
for a space and%5B
and%5C
for[
and]
, respectively.The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.
RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as
è
.RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:
Here I've asked Python to encode the Unicode
è
to a string of bytes using UTF-8. The bytes returned are0xc3
and0xa8
. Percent-encoded, this looks like%C3%A8
.The parenthesis also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.
So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.