I am trying to get the source from a URI. It's reported as UTF-8. I have also tried ISO-8859-1, ISO-8859-1 Windows-1250 and ISO-8859-2.
Here is my code of the latest attempt (trying ISO-8859-2):
public static String getPage(String page,String charset) throws IOException{
URL url=new URL(page);
return org.apache.commons.io.IOUtils.toString(url.openConnection().getInputStream(),charset);
}
public static void main(String args[])throws Exception{
String page=getPage("http://buscon.rae.es/drae/srv/search?val=aba","ISO-8859-2");
System.out.println(page);
}
But the result is :
apÄ?ge 'quita, aparta', y este del gr. á¼?Ï?αγε)
instead of:
(Del lat. apăge 'quita, aparta', y este del gr. ἄπαγε).
Likewise UTF-8 (which works with other code, and in browsers) and other encoding names, also fail in a similar manner.
U+0103 (ă) is encoded as the byte sequence
C4 83
; this data is UTF-8.The bug is likely due to the other transcoding operation you are performing via the
PrintStream
attached toSystem.out
. This will encode the data to the system encoding, which may be a lossy conversion and may cause corruption if the device being written to doesn't use a matching encoding.You can read some analysis of this with respect to the Windows console here.