How to retrieve HTML page in proper encoding using

2019-07-19 10:05发布

问题:

How can I read HTTP stream with HTML page in page's encoding?

Here is a code fragment I use to get the HTTP stream. InputStreamReader has the encoding optional argument, but I have no ideas about the way to obtain it.

URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
BufferedReader d = new BufferedReader(new InputStreamReader(is));

回答1:

Retrieving a Webpage is a reasonably complicated process. That's why libraries such as HttpClient exist. My advice is that unless you have a really compelling reason otherwise, use HttpClient.



回答2:

When the connection is establised thru

URLConnection conn = url.openConnection();

you can get the encoding method name thru url.getContentEncoding() so pass this String to InputStreamReader() so the code looks like

BufferedReader d = new BufferedReader(new InputStreamReader(is,url.getContentEncoding()));



回答3:

The short answer is URLConnection.getContentEncoding(). The right answer is what cletus suggests, use an appropriate third party library unless you have a compelling reason not to.



回答4:

I had a very similar problem to solve recently. Like the other answers, I also started playing around with HttpClient et al. However, those libraries require that you know upfront the encoding of the file you want to download. Otherwise, conversion of the retrieved HTML file will yield in unreadable characters.

This approach won't work, because the encoding of the HTML file is specified only in the HTML file itself. Depending on the HTML version, the encoding is specified in many different ways like XML header, two different head meta tag elements, etc. If you follow this approach, you would need to:

  1. Download file and look at the content to figure out the encoding by parsing the HTML content.
  2. Download file a second time to specify proper encoding.

Especially parsing HTML content for proper encoding strings is error-prone. Instead, I suggest you rely on a library like JSoup, which will do the job for you. So instead of downloading the file via httpclient, use JSoup to retrieve the file for you. In addition, JSoup provides a nice API to access different parts of the HTML page directly (e.g. page title).