I'm having trouble encoding a URL to a URI:
mUrl = "A string url that needs to be encoded for use in a new HttpGet()";
URL url = new URL(mUrl);
URI uri = new URI(url.getProtocol(), url.getAuthority(), url.getPath(),
url.getQuery(), null);
This does not do what I expect for the following URL:
Passing in the String:
http://m.bloomingdales.com/img?url=http%3A%2F%2Fimages.bloomingdales.com%2Fis%2Fimage%2FBLM%2Fproducts%2F3%2Foptimized%2F1140443_fpx.tif%3Fwid%3D52%26qlt%3D90%2C0%26layer%3Dcomp%26op_sharpen%3D0%26resMode%3Dsharp2%26op_usm%3D0.7%2C1.0%2C0.5%2C0%26fmt%3Djpeg&ttl=30d
Comes out as:
http://m.bloomingdales.com/img?url=http%253A%252F%252Fimages.bloomingdales.com%252Fis%252Fimage%252FBLM%252Fproducts%252F3%252Foptimized%252F1140443_fpx.tif%253Fwid%253D52%2526qlt%253D90%252C0%2526layer%253Dcomp%2526op_sharpen%253D0%2526resMode%253Dsharp2%2526op_usm%253D0.7%252C1.0%252C0.5%252C0%2526fmt%253Djpeg&ttl=30d
Which is broken. For example, the %3D
is turned into %253D
It seems to be doing something mysterious to the %'s already in the string.
What's going on and what am I doing wrong here?
%3d means-> = (Equal)
And
%253D --> = (Equal) decimal 6hex (byte) 3D
%253D hex indicator for CGI: %3D
You are first putting the (already-escaped) string into the
URL
class. That doesn't escape anything. Then you are pulling out sections of theURL
, which returns them without any further processing (so -- they are still escaped since they were escaped when you put them in). Finally, you are putting the sections into theURI
class, using the multi-argument constructor. This constructor is specified as encoding the URI components using percentages.Therefore, it is in this final step that, for example, "
:
" becomes "%3A
" (good) and "%3A
" becomes "%253A
" (bad). Since you are putting in URLs which are already-encoded*, you don't want to encode them again.Therefore, the single-argument constructor of
URI
is your friend. It doesn't escape anything, and requires that you pass a pre-escaped string. Hence, you don't needURL
at all:*The only problem is if your URLs are sometimes not percent-encoded, and sometimes they are. Then you have a bigger problem. You need to decide whether your program is starting out with a URL which is always encoded, or one which needs to be encoded.
Note that there is no such thing as a full URL which is not percent-encoded. For example, you can't take the full URL "
http://example.com/bob&co
" and somehow turn it into the properly-encoded URL "http://example.com/bob%26co
" -- how can you tell the difference between the syntax (which shouldn't be escaped) and the characters (which should)? This is why the single-argument form ofURI
requires that strings are already-escaped. If you have unescaped strings, you need to percent-encode them before inserting them into the full URL syntax, and that is what the multi-argument constructor ofURI
helps you do.Edit: I missed the fact that the original code discards the fragment. If you want to remove the fragment (or any other part) of the URL, you can construct the
URI
as above, then pull all the parts out as required (they will be decoded into regular strings), then pass them back into theURI
multi-argument constructor (where they will be re-encoded as URI components):The
URL
class didn't decode the %-sequences when it parsed the URL, but theURI
class is encoding them (again). UseURI
to parse the URL string.Javadocs:
http://download.oracle.com/javase/6/docs/api/java/net/URL.html
What is happening here is that the
%
signs from the first URL are being escaped, meaning they are turned into%25
in the output. You need to put precautions in place so that your script only escapes alphanumeric characters, as well as some symbols — but not already escaped characters.These are some characters that NEED escaping:
The rest, like
=
,%
, and&
, and alphanumeric characters, do not.