String encoding conversion UTF-8 to SHIFT-JIS

Variables used:

JavaSE-6
No frameworks

Given this string input of ピーター・ジョーズ which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.

Input (UTF-8 encoding): ピーター・ジョーンズ
Output (SHIFT-JIS encoding): ピーター・ジョーンズ (SHIFT-JIS to be encoded)

I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:

stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

Both code snippets return this string output: �s�[�^�[�E�W��[��Y (SHIFT-JIS encoded)

Any ideas on how this can be resolved?

回答1:

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

(Note "encoding", "charset" and Charset are more or less synonyms.)

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

So if you have a byte[] that's encoded using UTF-8:

byte[] utf8Bytes = "ピーター・ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ・ ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

You can create a String from those bytes using:

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター・ジョーズ"

Then you can encode that String as Shift-JIS using:

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ・ ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8 will produce garbage:

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "�s�[�^�[�E�W���[�Y"
// � is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (� s � [ � ^)
// 81 5b 81 45 83 57   (� [ � E � W)
// 83 87 81 5b 83 59   (� � � [ � Y)

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437 or CP850) before presenting it to you.

This last part might be tripping you up.

回答2:

"MS932" instead of Shift-JIS/SJIS may make it.