Why does this BufferedReader not read in the speci

2020-03-26 09:03发布

I am scraping a few websites and some of them contain non-Latin Characters and special characters like for quotes rather than " and for apostrophes rather than '.

Here's the real curve ball...

I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help” is printed out as: ΓÇ£I Need Your HelpΓÇ¥...

Before anyone says I need to set my JAVA_TOOL_OPTIONS Environment Variable to -Dfile.encoding=UTF8 let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8" override that anyway?

Here's some info:

  • I'm using the JDK 7 with the target platform as 1.7
  • I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the JAVA_TOOL_OPTIONS set, but that doesn't seem to make any difference).
  • I think the default encoding that it's using is Cp1252...

Here's my code. Let me know whether you need more info. Thanks!

/**
 * Using the given url, this method creates and returns the buffered reader for that url
 *
 * @param urlString
 * @return
 * @throws MalformedURLException
 * @throws IOException
 */
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
  URL url = new URL(urlString);
  InputStream is = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  return br;
}

3条回答
对你真心纯属浪费
2楼-- · 2020-03-26 09:18

There are two possibilities here. As user1291492 said, it could be that you read the content correctly but the encoding that your terminal uses is different from the one your IDE uses.

The other possibility is that the source data is not in UTF-8. If you're scraping a website, then you should pay attention to what the Website tells you it's using for encoding via the Content-Type header, not assume that it's always UTF-8.

查看更多
Viruses.
3楼-- · 2020-03-26 09:29
try {
        reader = new BufferedReader(new InputStreamReader(in,"UTF-8"));
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
      String line="";
      String s ="";
   try 
   {
       line = reader.readLine();
   } 
   catch (IOException e) 
   {
       e.printStackTrace();
   }
      while (line != null) 
      {
       s = s + line;
       s =s+"\n";
       try 
       {
           line = reader.readLine();
       } 
       catch (IOException e) 
       {
           e.printStackTrace();
       }
    }
    tv.setText(""+s);
  }
查看更多
Evening l夕情丶
4楼-- · 2020-03-26 09:36

IDE's output "window" probably has the capacity to understand and print utf-8 characters. The console may not be so advanced

查看更多
登录 后发表回答