Non-english characters are decoded incorrectly on

2019-05-21 01:49发布

问题:

I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.

I've implemented this in an external jar file that I import into my Android app.

When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.

If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.

When I copy the text from the debuggers I get these results:

Java Process (Unit Test): «Blårek», «Benny»

Android Process (In emulator): «Bl�rek», «Benny»

I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.

I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.

Here is the code i run:

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.

Thanks,
Geir

回答1:

HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.

You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.