Why when building a query string with http_build_query
function, it urlencodes square brackets []
outside values and how do get rid of it?
$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query($query, "", "&");
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString));
outputs:
var%5Bfoo%5D=value&var%5Bbar%5D=encodedBracket%5B
urldecoded: var[foo]=value&var[bar]=encodedBracket[
The function correctly urlencoded a [
in encodedBracket[
in the first line of the output but what was the reason to encode square brackets in var[foo]=
and var[bar]=
? As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B
should have stayed as was for the query string to be correct and not become encodedBracket[
.
According to section 2.2 Reserved Characters of Uniform Resource Identifier (URI): Generic Syntax
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
So shouldn't http_build_query
really produce more readable output with characters like []
urlencoded only where it's required? How do I make it produce such output?
I found the following "fix" here:
So now it would become:
You've got many questions here. Speaking in RFC terms of should, and reading your own questions in these same terms. I take your questions from bottom to top:
By using a different encoder, Net_URL2 (pear / packagist) for example:
No, it should not. Even it is not required to encode the square brackets inside the query part, it is recommended. That what is recommended should be done.
Next to that, the
http_build_query()
function is not about creating "more readable output". It is only about creating the query of an HTTP URI. For such a query part, square brackets should be percent-encoded. These are reserved characters not specifically allowed for query.The reason to encode square brackets there is the same reason to encode square brackets in
encodedBracket[
. The differentiation you do between these parts in your question is purely syntactic on your own, within an URI these parts are treated equal. There are no sub-parts of a query part in an URI. So making a distinction between the bracketvar[
or the bracketencodedBracket[
is purely unrelated to URI encoding of the query part.As you say that the percent-encoding of
encodedBracket[
toencodedBracket%5B
is correct and as it belongs into the same part of the URI (the query part), logic dictates that you must accept that encoding the bracket invar[
tovar%5B
is equally correct in terms of URI encoding. Same URI part, same encoding. The only ending delimiter the query part has is "#
".Additionally your reasoning shows a misunderstanding in this part:
If you urldecode, all percent-encoded sequences will be decoded - regardless whether the percent-encoding was representing a reserved character or not. In terms of correct, it's the opposite of what you stated:
%5B
has to be decoded to[
regardless if it was at the beginning, in the middle or at the end of the string.It's easier to answer the second part, see at the beginning of the answer, it's already answered.
About the why this perhaps might not be immediately visible especially as you might have found out that PHP itself accepts percent-encoded and verbatim square brackets in the query (even intermixed) without any problems.
How come the differences and why is that so? Is it really as simple as you outline it in your question? Is this a cosmetic difference only?
First of all, not encoding square brackets in the query query part of an URI violates RFC3986 in the sense that the query part should not contain the brackets from gen-delims characters unencoded. Non-percent-encoded square brackets can not be part of query according to the ABNF:
Getting rid of these therefore is not suggested (at least for encoding purposes following the standard) as it will change the URI:
This is already a good hint that for the URI you ask for, it has a different meaning than the URI PHP creates via the built-in function.
And further on:
This is not the case for all characters in gen-delims but per the ABNF:
So it therefore looks like that
http_build_query()
went the route to percent-encode square brackets as those are reserved characters and not specifically allowed by the URI scheme for that part (the query). Basically nothing wrong with it, it follows the recommendation of RFC3986. And it is not suggesting a different meaning for those parts of the query.However you clearly say, that technically these brackets aren't delimiters in the query. And yes, that is true:
So comparing to what has been identified earlier as reserved characters not specifically allowed:
(already a pretty small list) it should be clear that "
#
" must stay reserved otherwise the URI gets broken (a true, separating delimiter at the end of query), but the square brackets must not be specifically allowed when representing an unequal URI without data-loss and preserving all URI delimiters:So if you can still follow me, one might want actually do what you're asking for: Creating an URI in which the square brackets meaning as a delimiter (e.g. representing a fraction of an array definition) but not having this as data. Albeit the data of the character is preserved per RFC 3986.
It therefore is technically possible to create an URI with the square brackets not percent encoded within the query. Technically even inside values, like it would be a syntactical difference outside of values, this is only another syntactic difference for inside of values.
This is also the reason why browsers preserve the state of square brackets within the query when you enter these into your browser. Percent-encoded or not - the browser passes that part of the URI as-is to the server so that the underlying processes on the server can benefit from syntactic differences that might have been expressed by that.
So choose the URL encoding correctly for the underlying platform. Only because it's possible, it must not mean it works in a stable manner. The way
http_build_query()
does is the most stable (safe) way following RFC 3986. However it's a should in the RFC, so if you understand this to the point, you can have valid reasons to not percent-encode the square brackets.One reason you name in your question is readability. This is especially important when you write down URLs for example on a sheet of paper. I'm not so sure if a square bracket is such a good distinguishable character and if not percent encoding does even help with readability. But I have not tried it. PHP would accept both ways. But then you won't need to do that programmatically. So perhaps readability wasn't really the case in your scenario.
Here's a quick function I wrote to produce nicer query strings. It not only doesn't encode square brackets but will also omit the array key if it matches the index. Note it doesn't support objects or the additional options of
http_build_query
. The$prefix
argument is used for recursion and should be omitted for the initial call.