http_build_query function's excessive urlencod

2019-07-04 00:54发布

Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?

$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query($query, "", "&");
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString));

outputs:

var%5Bfoo%5D=value&var%5Bbar%5D=encodedBracket%5B
urldecoded: var[foo]=value&var[bar]=encodedBracket[

The function correctly urlencoded a [ in encodedBracket[ in the first line of the output but what was the reason to encode square brackets in var[foo]= and var[bar]=? As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.

According to section 2.2 Reserved Characters of Uniform Resource Identifier (URI): Generic Syntax

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required? How do I make it produce such output?

3条回答
欢心
2楼-- · 2019-07-04 01:31

I found the following "fix" here:

[...] the workable 'fix' I have been using was to postprocess http_build_query() output with the following - a 'solution' which makes my skin crawl just a little:

function http_build_query_unborker($s) {
    return preg_replace_callback('#%5[bd](?=[^&]*=)#i', function($match) {
        return urldecode($match[0]); 
    }, $s);
}

So now it would become:

$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query_unborker(http_build_query($query, "", "&"));
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString)); // var[foo]=value&var[bar]=encodedBracket%5B
查看更多
戒情不戒烟
3楼-- · 2019-07-04 01:33

You've got many questions here. Speaking in RFC terms of should, and reading your own questions in these same terms. I take your questions from bottom to top:

How do I make it produce such output?

By using a different encoder, Net_URL2 (pear / packagist) for example:

$vars = array("var" => array("foo" => "value", "bar" => "encodedBracket["));

$url = new Net_URL2('');
$url->setQueryVariables($vars);
$query = $url->getQuery();

var_dump($query); // string(41) "var[foo]=value&var[bar]=encodedBracket%5B"

So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required?

No, it should not. Even it is not required to encode the square brackets inside the query part, it is recommended. That what is recommended should be done.

Next to that, the http_build_query() function is not about creating "more readable output". It is only about creating the query of an HTTP URI. For such a query part, square brackets should be percent-encoded. These are reserved characters not specifically allowed for query.

What was the reason to encode square brackets in var[foo]= and var[bar]=?

The reason to encode square brackets there is the same reason to encode square brackets in encodedBracket[. The differentiation you do between these parts in your question is purely syntactic on your own, within an URI these parts are treated equal. There are no sub-parts of a query part in an URI. So making a distinction between the bracket var[ or the bracket encodedBracket[ is purely unrelated to URI encoding of the query part.

As you say that the percent-encoding of encodedBracket[ to encodedBracket%5B is correct and as it belongs into the same part of the URI (the query part), logic dictates that you must accept that encoding the bracket in var[ to var%5B is equally correct in terms of URI encoding. Same URI part, same encoding. The only ending delimiter the query part has is "#".

Additionally your reasoning shows a misunderstanding in this part:

As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.

If you urldecode, all percent-encoded sequences will be decoded - regardless whether the percent-encoding was representing a reserved character or not. In terms of correct, it's the opposite of what you stated: %5B has to be decoded to [ regardless if it was at the beginning, in the middle or at the end of the string.

Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?

It's easier to answer the second part, see at the beginning of the answer, it's already answered.

About the why this perhaps might not be immediately visible especially as you might have found out that PHP itself accepts percent-encoded and verbatim square brackets in the query (even intermixed) without any problems.

How come the differences and why is that so? Is it really as simple as you outline it in your question? Is this a cosmetic difference only?

First of all, not encoding square brackets in the query query part of an URI violates RFC3986 in the sense that the query part should not contain the brackets from gen-delims characters unencoded. Non-percent-encoded square brackets can not be part of query according to the ABNF:

 query         = *( pchar / "/" / "?" )

 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

 unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

 pct-encoded   = "%" HEXDIG HEXDI

 sub-delims    = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Getting rid of these therefore is not suggested (at least for encoding purposes following the standard) as it will change the URI:

URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.

This is already a good hint that for the URI you ask for, it has a different meaning than the URI PHP creates via the built-in function.

And further on:

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

This is not the case for all characters in gen-delims but per the ABNF:

"/" / "?" / ":" / "@"

So it therefore looks like that http_build_query() went the route to percent-encode square brackets as those are reserved characters and not specifically allowed by the URI scheme for that part (the query). Basically nothing wrong with it, it follows the recommendation of RFC3986. And it is not suggesting a different meaning for those parts of the query.

However you clearly say, that technically these brackets aren't delimiters in the query. And yes, that is true:

The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

So comparing to what has been identified earlier as reserved characters not specifically allowed:

"#" / "[" / "]"

(already a pretty small list) it should be clear that "#" must stay reserved otherwise the URI gets broken (a true, separating delimiter at the end of query), but the square brackets must not be specifically allowed when representing an unequal URI without data-loss and preserving all URI delimiters:

If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.

So if you can still follow me, one might want actually do what you're asking for: Creating an URI in which the square brackets meaning as a delimiter (e.g. representing a fraction of an array definition) but not having this as data. Albeit the data of the character is preserved per RFC 3986.

It therefore is technically possible to create an URI with the square brackets not percent encoded within the query. Technically even inside values, like it would be a syntactical difference outside of values, this is only another syntactic difference for inside of values.

This is also the reason why browsers preserve the state of square brackets within the query when you enter these into your browser. Percent-encoded or not - the browser passes that part of the URI as-is to the server so that the underlying processes on the server can benefit from syntactic differences that might have been expressed by that.

So choose the URL encoding correctly for the underlying platform. Only because it's possible, it must not mean it works in a stable manner. The way http_build_query() does is the most stable (safe) way following RFC 3986. However it's a should in the RFC, so if you understand this to the point, you can have valid reasons to not percent-encode the square brackets.

One reason you name in your question is readability. This is especially important when you write down URLs for example on a sheet of paper. I'm not so sure if a square bracket is such a good distinguishable character and if not percent encoding does even help with readability. But I have not tried it. PHP would accept both ways. But then you won't need to do that programmatically. So perhaps readability wasn't really the case in your scenario.

查看更多
甜甜的少女心
4楼-- · 2019-07-04 01:48

Here's a quick function I wrote to produce nicer query strings. It not only doesn't encode square brackets but will also omit the array key if it matches the index. Note it doesn't support objects or the additional options of http_build_query. The $prefix argument is used for recursion and should be omitted for the initial call.

function http_clean_query(array $query_data, string $prefix=null): string {
    $parts = [];
    $i = 0;
    foreach ($query_data as $key=>$value) {
        if ($prefix === null) {
            $key = rawurlencode($key);
        } else if ($key === $i) {
            $key = $prefix.'[]';
            $i++;
        } else {
            $key = $prefix.'['.rawurlencode($key).']';
        }
        if (is_array($value)) {
            if (!empty($value)) $parts[] = http_clean_query($value, $key);
        } else {
            $parts[] = $key.'='.rawurlencode($value);
        }
    }
    return implode('&', $parts);
}
查看更多
登录 后发表回答