Hi
I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters .
Any help please
here is my code:
String link = "some german link";
URL url = new URL(link);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
You have to set the correct encoding. You can find the encoding in the HTTP header:
Content-Type: text/html; charset=ISO-8859-1
This may be overwritten in the (X)HTML document, see HTML Character encodings
I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g. org.apache.httpcomponents
. The code will look like this:
DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");
try
{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
if (entity != null)
{
System.out.println(EntityUtils.toString(entity));
}
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}
This is the maven artifact:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.1.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
You need to specify the character set for your InputStreamReader, like
InputStreamReader(url.openStream(), "UTF-8")
Try to set an Charset.
new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));
First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.
If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.