utf-8 decoding in java

I'm trying to pass parameters from a PHP middle tier to a java backend that understands J2EE. I'm writing the controller code in Groovy. In there, I'm trying to decode some parameter that will likely contain international characters.

I am really puzzled by the results of my debugging this problem so far, hence I wanted to share it with you in the hope that someone will be able to give the correct interpretation of my results.

For the sake of my little test, the parameter I'm passing is "déjeuner". Just to be sure, System.out.println("déjeuner") correctly gives me:

déjeuner

in the console

Now following are the char/dec and hex values of each char of the original string:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note that the c3a9 sequence in UTF-8 is the wished-for character: http://www.fileformat.info/info/unicode/char/00e9/index.htm

Now if I try to read this string as an UTF-8 string, as in stmt.getBytes("UTF-8"), I suddenly end up having a 11 bytes sequence, as follows:

64 c3 83 c2 a9 6a 65 75 6e 65 72

whereas stmt.getBytes("iso-8859-1") gives me 9 bytes:

64 c3 a9 6a 65 75 6e 65 72

note the c3a9 sequence here!

now if I try to convert the UTF-8 sequence to UTF-8, as in

new String(stmt.getBytes("UTF-8"), "UTF-8");

I get:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the c3a9 sequence

while

new String(stmt.getBytes("iso-8859-1"), "UTF-8")

results in:

next char: d 100 64
next char: ? -23 e9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the e9 which in utf-8 (and ascii) is, again, the 'é' character that I'm longing for.

Unfortunately, in neither case am I ending up with a proper string that would display like the literal string "déjeuner". Strangely enough, the byte sequences both seem correct though.

标签： java encoding utf-8 groovy

4条回答

时光不老，我们不散

2楼-- · 2020-06-17 04:43

If you start with the Java String where "d\u00C3\u00A9jeuner".equals(stmt) then the data is already corrupt at this stage.

A Java char is not a C char. A char in Java is 16bits wide and implicitly contains UTF-16 encoded data. Trying to store any other encoded data in a Java char/String type is asking for trouble. Character data in any other encoding should be as byte data.

If you are reading the parameter using the servlet API, then it is likely that the HTTP request contains inconsistent or insufficient encoding information. Check the calling code and the HTTP headers. It is likely that the client is encoding the data as UTF-8, but the servlet is decoding it as ISO-8859-1.

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2020-06-17 04:46

When dealing with Strings, always remember: byte != char. So in your first example, you have the char c3, not the byte c3 which is a huge difference: The byte would be part of the UTF-8 sequence but the char already is Unicode. So when you convert that to UTF-8, the Unicode character c3 must become the byte sequence c3 83.

So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded byte sequences.

The reason why ISO-8859-1 usually works is that this encoding doesn't modify any char with a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encoded byte sequences won't be modified.

Your last example is also wrong: The char e9 is é in ISO-8859-1 and Unicode. In UTF-8, it's not valid since it's not a byte and since it's the byte c3 prefix is missing. That said, it correctly represents the Unicode string you seek.

0人赞添加讨论(0) 举报

家丑人穷心不美

4楼-- · 2020-06-17 04:53

After some further investigation I found this answer

How to get UTF-8 working in Java webapps?.

It's all about setting URIEncoding="UTF-8" in the tomcat connector.

Now to figuring out on how to do this in the CMS we use (CQ5/Day).

0人赞添加讨论(0) 举报

疯言疯语

5楼-- · 2020-06-17 05:01

I'm having a very similar problem except that my form uses "GET" request not a "POST" request.

So, my URL is something like: http://localhost:4502/form.jsp?query=d%C3%A9jeuner

request.getCharacterEncoding() = ISO-8859-1
response.getCharacterEncoding() = UTF-8
request.getParameter("query") = dÃ©jeuner

So should the HttpServletRequest use UTF-8 to decode the request param (which clearly it's not) or is this simply a browser error because the browser does not set any character encoding header (which again doesn't make much sense because it's not doing a post request). Here is the full set of headers and notice the %C3%A9 in the URL.

http://localhost:4502/form.jsp?query=d%C3%A9juerne

GET /form.jsp?query=d%C3%A9juerne HTTP/1.1
Host: localhost:4502
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.17) Gecko/2010010604 Ubuntu/9.04 (jaunty) Firefox/3.0.17
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

This problem I'm having is that I actually copied and pasted the query into the browser form and it incorrectly encoded it. Both in chrome and firefox.

0人赞添加讨论(0) 举报

utf-8 decoding in java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间