After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using +
instead of %20
for spaces, while still having an application/x-www-urlencoded
content type.
This leads me to think the +
vs. %20
behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX
string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
This leads me to think the +
vs. %20
behavior depends on which part of the URL is being encoded.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of '+'
for encoding space characters is specific to the application/x-www-form-urlencoded
format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.
The application/x-www-form-urlencoded
format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP GET
request instead of a POST
request, the webform data is encoded using application/x-www-form-urlencoded
and placed as-is in the URL query
component.
Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+'
is a reserved character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The query
component explicitly allows unencoded '+'
characters, as it allows characters from sub-delims
:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
So, in the context of a webform submission, spaces are encoded using '+'
prior to then being put as-is into the query
component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded
is compatible with the definition of the query
component.
So, for example: http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the query
component requires the use of pct-encoded
, since ' '
is not included in either unreserved
or sub-delims
, and is not explicitly allowed by the query
definition.
So, for example: http://server/script?hello%20world
Similar rules also apply to the path
component, due to its use of pchar
:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
So, although path
does allow for unencoded sub-delims
characters, a '+'
character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded
is not used with the path
component, so a space character has to be encoded as %20
due to the definitions of pchar
and segment-nz-nc
.
Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset
attribute or hidden _charset_
field directly in the <form>
itself, otherwise the charset is typically the charset used by the parent HTML.
However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding
), but those are not commonly used.