Discrepancies of Percent Encoding for URLs

2019-05-30 07:21发布

问题:

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.


Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.

回答1:

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.

Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.

The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.

The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:

Section 17.13.3 Processing form data, Step four: Submit the encoded form data set

This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:

If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.

• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.

Section 17.13.4 Form content types, application/x-www-form-urlencoded

This is the default content type. Forms submitted with this content type must be encoded as follows:

1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').

2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.

So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.

Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

'+' is a reserved character:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded = "%" HEXDIG HEXDIG

pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"

query       = *( pchar / "/" / "?" )

So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.

So, for example: http://server/script?field=hello+world

However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.

So, for example: http://server/script?hello%20world

Similar rules also apply to the path component, due to its use of pchar:

  path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0<pchar>
  segment       = *pchar
  segment-nz    = 1*pchar
  segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.

Now, regarding the charset used to encode characters -

For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.

However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding ), but those are not commonly used.