I have the following in code to convert from UTF-8 to ISO-8859-1 in a jar file and when I execute this jar in Windows I get one result and in CentOS I get another. Might anyone know why?
public static void main(String[] args) {
try {
String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes());
CharBuffer data = utf8charset.decode(inputBuffer);
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
String z = new String(outputData);
System.out.println(z);
}
catch(Exception e) {
System.out.println(e.getMessage());
}
}
In Windows, java -jar test.jar > test.txt creates a file containing: Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »
but in CentOS I get: �?, ä, �?, é, �?, ö, �?, ü, �?, «, »
Three possibilities spring to mind:
x.getBytes()
may differ by platformIt's not clear in what way you're trying to convert from UTF-8 to ISO-8859-1 - because your original data is actually just a
String
. You're treating the results of callingx.getBytes()
as if it were UTF-8-encoded data, but it's just whatever the platform default is...Likewise when you write:
... that's using the platform default encoding. Don't do that.
You don't need the byte buffer stuff at all: just encode using
text.getBytes(encoding)
and decode usingnew String(data, encoding)
.You should first and foremost get the string in correct internal representation in java before even thinking about output. I.E. it should be that:
The above can be verified without any output encoding issues, because it simply prints
true
orfalse
.In Windows you already achieved this with
Because all you need to go from
"Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »"
to"Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »"
is:After this you do something with ISO-8859-1, which is futile because barely half the characters in your original string can be represented in ISO-8859-1 not to mention you are already done as per above. You can delete the code after
utf8charset.decode(inputBuffer)
So now your code could look like:
These two lines
are platform and default encoding specific.
This runs as expect on Windows and Linux by avoiding platform specific conversions.
prints