I'm using HtmlCleaner
to scrape a ISO-8859-1
encoded web site in Android.
I've implemented this in an external jar
file that I import into my Android app.
When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å
) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.
If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.
When I copy the text from the debuggers I get these results:
Java Process (Unit Test): «Blårek», «Benny»
Android Process (In emulator): «Bl�rek», «Benny»
I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.
I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true)
without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.
Here is the code i run:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.
Thanks,
Geir