HTTP URL Address Encoding in Java

2018-12-31 03:55发布

My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encode the HTTP URL address properly...

Example:

URL:  http://search.barnesandnoble.com/booksearch/first book.pdf

java.net.URLEncoder.encode(url.toString(), "ISO-8859-1");

returns me:

http%3A%2F%2Fsearch.barnesandnoble.com%2Fbooksearch%2Ffirst+book.pdf

But, what I want is

http://search.barnesandnoble.com/booksearch/first%20book.pdf

(space replaced by %20)

I guess URLEncoder is not designed to encode HTTP URLs... The JavaDoc says "Utility class for HTML form encoding"... Is there any other way to do this?

24条回答
何处买醉
2楼-- · 2018-12-31 04:46

Unfortunately, org.apache.commons.httpclient.util.URIUtil is deprecated, and the replacement org.apache.commons.codec.net.URLCodec does coding suitable for form posts, not in actual URL's. So I had to write my own function, which does a single component (not suitable for entire query strings that have ?'s and &'s)

public static String encodeURLComponent(final String s)
{
  if (s == null)
  {
    return "";
  }

  final StringBuilder sb = new StringBuilder();

  try
  {
    for (int i = 0; i < s.length(); i++)
    {
      final char c = s.charAt(i);

      if (((c >= 'A') && (c <= 'Z')) || ((c >= 'a') && (c <= 'z')) ||
          ((c >= '0') && (c <= '9')) ||
          (c == '-') ||  (c == '.')  || (c == '_') || (c == '~'))
      {
        sb.append(c);
      }
      else
      {
        final byte[] bytes = ("" + c).getBytes("UTF-8");

        for (byte b : bytes)
        {
          sb.append('%');

          int upper = (((int) b) >> 4) & 0xf;
          sb.append(Integer.toHexString(upper).toUpperCase(Locale.US));

          int lower = ((int) b) & 0xf;
          sb.append(Integer.toHexString(lower).toUpperCase(Locale.US));
        }
      }
    }

    return sb.toString();
  }
  catch (UnsupportedEncodingException uee)
  {
    throw new RuntimeException("UTF-8 unsupported!?", uee);
  }
}
查看更多
浅入江南
3楼-- · 2018-12-31 04:47

a solution i developed and much more stable than any other:

public class URLParamEncoder {

    public static String encode(String input) {
        StringBuilder resultStr = new StringBuilder();
        for (char ch : input.toCharArray()) {
            if (isUnsafe(ch)) {
                resultStr.append('%');
                resultStr.append(toHex(ch / 16));
                resultStr.append(toHex(ch % 16));
            } else {
                resultStr.append(ch);
            }
        }
        return resultStr.toString();
    }

    private static char toHex(int ch) {
        return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
    }

    private static boolean isUnsafe(char ch) {
        if (ch > 128 || ch < 0)
            return true;
        return " %$&+,/:;=?@<>#%".indexOf(ch) >= 0;
    }

}
查看更多
初与友歌
4楼-- · 2018-12-31 04:49

If you have a URL, you can pass url.toString() into this method. First decode, to avoid double encoding (for example, encoding a space results in %20 and encoding a percent sign results in %25, so double encoding will turn a space into %2520). Then, use the URI as explained above, adding in all the parts of the URL (so that you don't drop the query parameters).

public URL convertToURLEscapingIllegalCharacters(String string){
    try {
        String decodedURL = URLDecoder.decode(string, "UTF-8");
        URL url = new URL(decodedURL);
        URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef()); 
        return uri.toURL(); 
    } catch (Exception ex) {
        ex.printStackTrace();
        return null;
    }
}
查看更多
呛了眼睛熬了心
5楼-- · 2018-12-31 04:52

Nitpicking: a string containing a whitespace character by definition is not a URI. So what you're looking for is code that implements the URI escaping defined in Section 2.1 of RFC 3986.

查看更多
其实,你不懂
6楼-- · 2018-12-31 04:53

There is still a problem if you have got an encoded "/" (%2F) in your URL.

RFC 3986 - Section 2.2 says: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed." (RFC 3986 - Section 2.2)

But there is an Issue with Tomcat:

http://tomcat.apache.org/security-6.html - Fixed in Apache Tomcat 6.0.10

important: Directory traversal CVE-2007-0450

Tomcat permits '\', '%2F' and '%5C' [...] .

The following Java system properties have been added to Tomcat to provide additional control of the handling of path delimiters in URLs (both options default to false):

  • org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH: true|false
  • org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH: true|false

Due to the impossibility to guarantee that all URLs are handled by Tomcat as they are in proxy servers, Tomcat should always be secured as if no proxy restricting context access was used.

Affects: 6.0.0-6.0.9

So if you have got an URL with the %2F character, Tomcat returns: "400 Invalid URI: noSlash"

You can switch of the bugfix in the Tomcat startup script:

set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%   -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true 
查看更多
时光乱了年华
7楼-- · 2018-12-31 04:54

In addition to the Carlos Heuberger's reply: if a different than the default (80) is needed, the 7 param constructor should be used:

URI uri = new URI(
        "http",
        null, // this is for userInfo
        "www.google.com",
        8080, // port number as int
        "/ig/api",
        "weather=São Paulo",
        null);
String request = uri.toASCIIString();
查看更多
登录 后发表回答