Trying to read from a URL(in Java) produces gibber

2019-03-01 14:38发布

问题:

I'm trying to read from a URL, and then print the result.

BufferedReader in = new BufferedReader(
     new InputStreamReader(new URL("http://somesite.com/").openStream(), "UTF-8"));
String s = "";
while ((s=in.readLine())!=null) System.out.println(s);
in.close();

It works great most of the time, and prints the website's source. However, my problem is, on specific websites, instead of the source code, it will print out gibberish, such as symbols and other unusual characters.

Is there some property that varies from website to website that would affect how it is read? The page loads just fine in Firefox, and I can view the source there with no problem. If firefox can access the source, I should be able to as well; I'm just not sure why it isn't working...

EDIT: added "UTF-8" to InputStreamReader. All of the strange characters are now question marks...still not working...

回答1:

So after much searching I found the answer to this. The xml is read as gibberish because it is Gzip compressed. The way to read this is by using the GZIPInputStream. This is because the XML is compressed differently.

HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestProperty("Accept-Encoding", "gzip");
    InputStreamReader in = new InputStreamReader (new GZIPInputStream(connection.getInputStream()));
    String str;            
    while (true) {
 int ch = in.read();
 if (ch==-1) {
    break;
 }


回答2:

You're probably running into a character encoding issue.

There should be an HTTP header like the following in the response:

Content-Type: text/html; charset=UTF-8


回答3:

Try using telnet to diagnose what's coming over the wire. It may not be textual data. For example, what happens when yo do this?

telnet somesite.com 80
GET / HTTP/1.0
Host: somesite.com

(two carriage returns required after last line)

This should allow you to see the headers and content coming in and should give you a better clue as to what's going on.



回答4:

I had the same issue until I used HttpURLConnection with setChunkedStreamingMode set.

            HttpURLConnection connection = (HttpURLConnection)serverAddress.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);
            connection.setReadTimeout(2000);
            connection.setChunkedStreamingMode(0);

            connection.connect();

            BufferedReader rd  = new BufferedReader(new InputStreamReader(connection.getInputStream()));

            String line = "";

            while ((line = rd.readLine()) != null)
            {
                sb.append(line);
            }

            System.out.println(sb.toString());