Variables used:
- JavaSE-6
- No frameworks
Given this string input of ピーター・ジョーズ
which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.
- Input (UTF-8 encoding):
ピーター・ジョーンズ
- Output (SHIFT-JIS encoding):
ピーター・ジョーンズ
(SHIFT-JIS to be encoded)
I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:
stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")
Both code snippets return this string output: �s�[�^�[�E�W���[���Y
(SHIFT-JIS encoded)
Any ideas on how this can be resolved?
Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.
(Note "encoding", "charset" and Charset are more or less synonyms.)
A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).
If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.
What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.
Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.
In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:
So if you have a byte[] that's encoded using UTF-8:
You can create a String from those bytes using:
Then you can encode that String as Shift-JIS using:
Since those bytes represent a string encoded using
Shift-JIS
, trying to decode usingUTF-8
will produce garbage:Further, remember that if you print a string to an output, for example
System.out
, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default isUTF-8
.Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably
CP437
orCP850
) before presenting it to you.This last part might be tripping you up.
"MS932" instead of Shift-JIS/SJIS may make it.