I am scraping a few websites and some of them contain non-Latin Characters and special characters like “
for quotes rather than "
and ’
for apostrophes rather than '
.
Here's the real curve ball...
I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer “I Need Your Help”
is printed out as: ΓÇ£I Need Your HelpΓÇ¥
...
Before anyone says I need to set my JAVA_TOOL_OPTIONS
Environment Variable to -Dfile.encoding=UTF8
let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8"
override that anyway?
Here's some info:
- I'm using the JDK 7 with the target platform as 1.7
- I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the
JAVA_TOOL_OPTIONS
set, but that doesn't seem to make any difference). - I think the default encoding that it's using is Cp1252...
Here's my code. Let me know whether you need more info. Thanks!
/**
* Using the given url, this method creates and returns the buffered reader for that url
*
* @param urlString
* @return
* @throws MalformedURLException
* @throws IOException
*/
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
URL url = new URL(urlString);
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
return br;
}