Reading the content of web page

2019-02-25 13:46发布

Hi I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters . Any help please here is my code:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

4条回答
爷的心禁止访问
2楼-- · 2019-02-25 14:29

First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.

If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.

查看更多
乱世女痞
3楼-- · 2019-02-25 14:32

You have to set the correct encoding. You can find the encoding in the HTTP header:

Content-Type: text/html; charset=ISO-8859-1

This may be overwritten in the (X)HTML document, see HTML Character encodings

I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g. org.apache.httpcomponents. The code will look like this:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

This is the maven artifact:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>
查看更多
一夜七次
4楼-- · 2019-02-25 14:40

You need to specify the character set for your InputStreamReader, like

InputStreamReader(url.openStream(), "UTF-8") 
查看更多
Fickle 薄情
5楼-- · 2019-02-25 14:46

Try to set an Charset.

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));
查看更多
登录 后发表回答