Gradle/Eclipse: Different behavior of german “Umla

2019-08-12 01:34发布

问题:

I am experiencing a weird behavior with german "Umlaute" (ä, ö, ü, ß) when using Java's equality checks (either directly or indirectly. Everything works as expected when running, debugging or testing from Eclipse and input containing "Umlaute" is treated as equal or not as expected.

However when I build the application using Spring Boot and run it, these equality checks fail for words that contain "Umlaute", i.e. for words like "Nationalität".

Input is retrieved from a webpage via Jsoup and content of a table is extracted for some keywords. The encoding of the page is UTF-8 and I have handling in place for Jsoup to convert it if this is not the case. The encoding of the source files is UTF-8 as well.

    Connection connection = Jsoup.connect(url)
                .header("accept-language", "de-de, de, en")
                .userAgent("Mozilla/5.0")
                .timeout(10000)
                .method(Method.GET);
    Response response = connection.execute();
    if(logger.isDebugEnabled())
        logger.debug("Encoding of response: " +response.charset());
    Document doc;
    if(response.charset().equalsIgnoreCase("UTF-8"))
    {
        logger.debug("Response has expected charset");
        doc = Jsoup.parse(response.body(), baseURL);
    }
    else
    {
        logger.debug("Response doesn't have exepcted charset and is converted");
        doc = Jsoup.parse(new String(response.bodyAsBytes(), "UTF-8"), baseURL);
    }

    logger.debug("Encoding of document: " +doc.charset());
    if(!doc.charset().equals(Charset.forName("UTF-8")))
    {
        logger.debug("Changing encoding of document from " +doc.charset());
        doc.updateMetaCharsetElement(true);
        doc.charset(Charset.forName("UTF-8"));
        logger.debug("Changed encoding of document to: " +doc.charset());
    }
    return doc;

Example log output (from deployed app) of reading content.

Encoding of response: utf-8
Response has expected charset
Encoding of document: UTF-8

Example input:

<tr><th>Nationalität:</th>     <td> [...] </td>    </tr>

Example code that fails for words containing ä, ö, ü or ß but works fine for other words:

Element header = row.select("th").first();
String text = header.ownText();
if("Nationalität:".equals(text))
{
 // goes here in eclipse
}
else
{
 // and here in deployed spring boot app
}

Is there any difference between running from Eclipse and a built & deployed app that I am missing? Where else could this behavior come from and how I this be resolved?

As far as I can see this is not (directly) an encoding issue since the input shows "Umlaute" correctly... Since this is not reproducible when debugging, I am having a hard time figuring out what exactly goes wrong.

Edit: While input looks fine in logs (i.e. diacritics show up correctly) I realized that they don't look correct in the console: <th>Nationalität:</th>

I am currently using a Normalizer as suggested by Mirko like this: Normalizer.normalize(input, Form.NFC); (also tried it with NFD). How do (SpringBoot-) console and (logback) logoutput differ?

回答1:

Diacritics like umlauts can often be represented in two different ways in unicode: As a single-codepoint character or as a composition of two characters. This isn't a problem of the encoding, it can happen in UTF-8, UTF-16, UTF-32 etc. Java's equals method may not consider composite characters equal to single-codepoint characters, even though they look exactly the same. Try to have a look at the binary representation of the strings you are comparing, this way you should be able to track down the differences. You could also use the methods of the "Character" class to iterate through the strings and print out the properties of all the characters. Maybe this helps, too, to figure out differences.

In any case, it could help if you use java.text.Normalizer on both "sides" of the "equals", to normalize the text to, for example, Unicode Normalization Form C. This way, differences like the aforementioned should be straightened out and the strings should compare as expected.



回答2:

Have you tried printing the keycode to console to see if they actually match when compiled? Maybe Eclipse is handling the charset gracefully but when it's compiled it's down to some Java/System settings?



回答3:

I think I tracked this down to the build of the standalone app being the culprit. As described above, when running from Eclipse all is fine, the problem only occurred when I ran the standalone Spring Boot app.

This is being built with Gradle. In my build.gradle I have

compileJava.options.encoding = 'UTF-8'

in order to force UTF-8 being used for encoding. This should (usually) be enough. I however also use AspectJ (via gradle-aspectj plugin) which apparently breaks this behavior (involuntarily?) and results in a default encoding to be used instead of the one explicitly defined. In order to solve this I added

compileAspect {
  additionalAjcArgs = ['encoding' : 'UTF-8']
}

to my build.gradle which passes the encoding option on to the ajc compiler. This seems to have fixed the problem for the regular build.

The problem still occurs however when tests are run from gradle. I was not yet able to find out what needs to be done there and why the above configuration is not enough. This is now tracked in a separate question.