HTTP query and URI encoding doubts [closed]

2019-01-28 13:23发布

问题:

Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.

In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)

The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).

This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?


A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)

But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...

Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.


Another question is real life support for semicolon separation.

HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).

Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?


Then how about non-ASCII characters support?

HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.

From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.

Are there any standards or practices how to approach such cases?


And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.

When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk.. How to deal with this issue?

回答1:

Have you checked HTTP specyfication (RFC2616)?

Take a look at those parts:

  • http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
  • http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2

The practical advice would be to use Base64 to encode the fields that you expect to contain risky characters and later on decode them on your backend.

Btw. Your question is really long. It decreases the chance that someone will dig into it.



回答2:

In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment

Yes, it does, in Section 3.4:

query       = *( pchar / "/" / "?" )

pchar is defined in Section 3.3:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

and ends on defining which characters are allowed and how to encode other characters.

Exactly. That is defining the format of the query string fragment.

But from RFC 3986 it seems that neither [ nor ] are allowed in query string.

Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.

Then how about non-ASCII characters support?

Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.

The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.

When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.