After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using +
instead of %20
for spaces, while still having an application/x-www-urlencoded
content type.
This leads me to think the +
vs. %20
behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX
string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of
'+'
for encoding space characters is specific to theapplication/x-www-form-urlencoded
format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.The
application/x-www-form-urlencoded
format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
Section 17.13.4 Form content types, application/x-www-form-urlencoded
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP
GET
request instead of aPOST
request, the webform data is encoded usingapplication/x-www-form-urlencoded
and placed as-is in the URLquery
component.Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
'+'
is a reserved character:The
query
component explicitly allows unencoded'+'
characters, as it allows characters fromsub-delims
:So, in the context of a webform submission, spaces are encoded using
'+'
prior to then being put as-is into thequery
component. This is allowed by the URL syntax, since the encoded form ofapplication/x-www-form-urlencoded
is compatible with the definition of thequery
component.So, for example:
http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the
query
component requires the use ofpct-encoded
, since' '
is not included in eitherunreserved
orsub-delims
, and is not explicitly allowed by thequery
definition.So, for example:
http://server/script?hello%20world
Similar rules also apply to the
path
component, due to its use ofpchar
:So, although
path
does allow for unencodedsub-delims
characters, a'+'
character gets treated as-is, not as an encoded space.application/x-www-form-urlencoded
is not used with thepath
component, so a space character has to be encoded as%20
due to the definitions ofpchar
andsegment-nz-nc
.Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an
accept-charset
attribute or hidden_charset_
field directly in the<form>
itself, otherwise the charset is typically the charset used by the parent HTML.However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding ), but those are not commonly used.