Will String.getBytes(“UTF-16”) return the same res

2020-04-07 03:42发布

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.

Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2020-04-07 04:14

It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.

If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.

Hence UTF-8 is a perfect fine choice.

String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
    s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.

Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.

There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.

For Windows UTF-16LE might be more native.

Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

查看更多
We Are One
3楼-- · 2020-04-07 04:18

Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)

So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)

The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.

查看更多
对你真心纯属浪费
4楼-- · 2020-04-07 04:22

I just found this:

https://github.com/facebook/conceal/issues/138

which seems to answer negatively your question.

As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

查看更多
登录 后发表回答