URL to URI encoding changes a “=” to “=”

2019-02-12 06:34发布

I'm having trouble encoding a URL to a URI:

mUrl = "A string url that needs to be encoded for use in a new HttpGet()";
URL url = new URL(mUrl);
URI uri = new URI(url.getProtocol(), url.getAuthority(), url.getPath(), 
    url.getQuery(), null);

This does not do what I expect for the following URL:

Passing in the String:

http://m.bloomingdales.com/img?url=http%3A%2F%2Fimages.bloomingdales.com%2Fis%2Fimage%2FBLM%2Fproducts%2F3%2Foptimized%2F1140443_fpx.tif%3Fwid%3D52%26qlt%3D90%2C0%26layer%3Dcomp%26op_sharpen%3D0%26resMode%3Dsharp2%26op_usm%3D0.7%2C1.0%2C0.5%2C0%26fmt%3Djpeg&ttl=30d

Comes out as:

http://m.bloomingdales.com/img?url=http%253A%252F%252Fimages.bloomingdales.com%252Fis%252Fimage%252FBLM%252Fproducts%252F3%252Foptimized%252F1140443_fpx.tif%253Fwid%253D52%2526qlt%253D90%252C0%2526layer%253Dcomp%2526op_sharpen%253D0%2526resMode%253Dsharp2%2526op_usm%253D0.7%252C1.0%252C0.5%252C0%2526fmt%253Djpeg&ttl=30d

Which is broken. For example, the %3D is turned into %253D It seems to be doing something mysterious to the %'s already in the string.

What's going on and what am I doing wrong here?

4条回答
在下西门庆
2楼-- · 2019-02-12 07:06

%3d means-> = (Equal)

And

%253D --> = (Equal) decimal 6hex (byte) 3D

%253D hex indicator for CGI: %3D

查看更多
姐就是有狂的资本
3楼-- · 2019-02-12 07:17

You are first putting the (already-escaped) string into the URL class. That doesn't escape anything. Then you are pulling out sections of the URL, which returns them without any further processing (so -- they are still escaped since they were escaped when you put them in). Finally, you are putting the sections into the URI class, using the multi-argument constructor. This constructor is specified as encoding the URI components using percentages.

Therefore, it is in this final step that, for example, ":" becomes "%3A" (good) and "%3A" becomes "%253A" (bad). Since you are putting in URLs which are already-encoded*, you don't want to encode them again.

Therefore, the single-argument constructor of URI is your friend. It doesn't escape anything, and requires that you pass a pre-escaped string. Hence, you don't need URL at all:

mUrl = "A string url is already percent-encoded for use in a new HttpGet()";
URI uri = new URI(mUrl);

*The only problem is if your URLs are sometimes not percent-encoded, and sometimes they are. Then you have a bigger problem. You need to decide whether your program is starting out with a URL which is always encoded, or one which needs to be encoded.

Note that there is no such thing as a full URL which is not percent-encoded. For example, you can't take the full URL "http://example.com/bob&co" and somehow turn it into the properly-encoded URL "http://example.com/bob%26co" -- how can you tell the difference between the syntax (which shouldn't be escaped) and the characters (which should)? This is why the single-argument form of URI requires that strings are already-escaped. If you have unescaped strings, you need to percent-encode them before inserting them into the full URL syntax, and that is what the multi-argument constructor of URI helps you do.

Edit: I missed the fact that the original code discards the fragment. If you want to remove the fragment (or any other part) of the URL, you can construct the URI as above, then pull all the parts out as required (they will be decoded into regular strings), then pass them back into the URI multi-argument constructor (where they will be re-encoded as URI components):

uri = new URI(uri.getScheme(), uri.getUserInfo(), uri.getHost(), uri.getPort(),
              uri.getPath(), uri.getQuery(), null)  // Remove fragment
查看更多
来,给爷笑一个
4楼-- · 2019-02-12 07:19

The URL class didn't decode the %-sequences when it parsed the URL, but the URI class is encoding them (again). Use URI to parse the URL string.

Javadocs:

http://download.oracle.com/javase/6/docs/api/java/net/URL.html

The URL class does not itself encode or decode any URL components according to the escaping mechanism defined in RFC2396. It is the responsibility of the caller to encode any fields, which need to be escaped prior to calling URL, and also to decode any escaped fields, that are returned from URL. Furthermore, because URL has no knowledge of URL escaping, it does not recognise equivalence between the encoded or decoded form of the same URL. For example, the two URLs:

http://foo.com/hello world/ and http://foo.com/hello%20world

would be considered not equal to each other. Note, the URI class does perform escaping of its component fields in certain circumstances.

The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL().

查看更多
一纸荒年 Trace。
5楼-- · 2019-02-12 07:23

What is happening here is that the % signs from the first URL are being escaped, meaning they are turned into %25 in the output. You need to put precautions in place so that your script only escapes alphanumeric characters, as well as some symbols — but not already escaped characters.

These are some characters that NEED escaping:

<
>
"
!
#
$
'
(
)
*
,
-
.
/
:
;
@
[
\
]
^
_
`
{
|
}
~

The rest, like =, %, and &, and alphanumeric characters, do not.

查看更多
登录 后发表回答