I am scraping a few websites and some of them contain non-Latin Characters and special characters like “
for quotes rather than "
and ’
for apostrophes rather than '
.
Here's the real curve ball...
I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help”
is printed out as: ΓÇ£I Need Your HelpΓÇ¥
...
Before anyone says I need to set my JAVA_TOOL_OPTIONS
Environment Variable to -Dfile.encoding=UTF8
let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8"
override that anyway?
Here's some info:
- I'm using the JDK 7 with the target platform as 1.7
- I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the
JAVA_TOOL_OPTIONS
set, but that doesn't seem to make any difference). - I think the default encoding that it's using is Cp1252...
Here's my code. Let me know whether you need more info. Thanks!
/**
* Using the given url, this method creates and returns the buffered reader for that url
*
* @param urlString
* @return
* @throws MalformedURLException
* @throws IOException
*/
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
URL url = new URL(urlString);
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
return br;
}
There are two possibilities here. As user1291492 said, it could be that you read the content correctly but the encoding that your terminal uses is different from the one your IDE uses.
The other possibility is that the source data is not in UTF-8. If you're scraping a website, then you should pay attention to what the Website tells you it's using for encoding via the
Content-Type
header, not assume that it's always UTF-8.IDE's output "window" probably has the capacity to understand and print utf-8 characters. The console may not be so advanced