How to encode a URL to be “browserable”?

2020-04-11 13:43发布

I want to know if there is any way to parse an URL like this:

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url with spaces &"others".xls

into

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url%20with%20spaces%20&%22others%22.xls

Similar to the URL rewriting that Firefox does when just pasting the former url, sending it to the server (without response unless you have a site like this) and then copying the URL from the navigation bar and pasting it somewhere else.

Using URLEncoder#encode gives me this (undesired) output:

https%3A%2F%2Fwww.mysite.com%2Flot%2Fof%2Funpleasant%2Ffolders%2Fand%2Fmy%2Furl+with+spaces+%26%22others%22.xls

Sadly, I receive a String as shown at the beginning of the question so using URLEncoder#encode directly doesn't work.

I naively tried this:

String evilUrl = "https://www.mysite.com/lot/of/unpleasant/folders/and/my/url with spaces &\"others\".xls";
URI uri = null;
String[] urlParts = evilUrl.split("://");
String scheme = urlParts[0];
urlParts = urlParts[1].split("/");
String host = urlParts[0];
StringBuilder sb = new StringBuilder('/');
for (int i = 1; i < urlParts.length; i++) {
    sb.append('/');
    sb.append(urlParts[i]);
}
uri = new URI(scheme, host, sb.toString(), null);
System.out.println(uri.toASCIIString());

And gives this (better) output:

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url%20with%20spaces%20&%22others%22.xls

But I'm not sure if there is an out-of-the-box solution there for this problem and I'm breaking my head for nothing or if I can rely that this piece of code can almost successfully solve my problem.


By the way, I already visited some resources on this topic:

标签: java url
1条回答
一纸荒年 Trace。
2楼-- · 2020-04-11 13:59

The problem with that sort of urls is that they are partially encoded, if you try to use an out-of-the-box encoder it will always encode the whole string, so I guess that your approach of using a custom encoder is correct. Your code is OK, you would just need to add some validations like, for instance, what if the "evil url" doesn't come with the protocol part (i. e. without the "https://") unless you're pretty sure it will never happen.

I have some spare time so I did an alternative custom encoder, the strategy I follow is to parse for chars that are not allowed in an URL and encode only those, rather than trying to re-encode the whole thing:

private static String encodeSemiEncoded(String semiEncondedUrl) {
    final String ALLOWED_CHAR = "!*'();:@&=+$,/?#[]-_.~";
    StringBuilder encoded = new StringBuilder();
    for(char ch: semiEncondedUrl.toCharArray()) {
        boolean shouldEncode = ALLOWED_CHAR.indexOf(ch) == -1 && !Character.isLetterOrDigit(ch) || ch > 127;
        if(shouldEncode) {
            encoded.append(String.format("%%%02X", (int)ch));
        } else {
            encoded.append(ch);
        }
    }
    return encoded.toString();
}

Hope this helps

查看更多
登录 后发表回答