I have the following in code to convert from UTF-8 to ISO-8859-1 in a jar file and when I execute this jar in Windows I get one result and in CentOS I get another. Might anyone know why?
public static void main(String[] args) {
try {
String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes());
CharBuffer data = utf8charset.decode(inputBuffer);
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
String z = new String(outputData);
System.out.println(z);
}
catch(Exception e) {
System.out.println(e.getMessage());
}
}
In Windows, java -jar test.jar > test.txt creates a file containing:
Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »
but in CentOS I get:
�?, ä, �?, é, �?, ö, �?, ü, �?, «, »
These two lines
x.getBytes());
String z = new String(outputData);
are platform and default encoding specific.
This runs as expect on Windows and Linux by avoiding platform specific conversions.
String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes(utf8charset));
CharBuffer data = utf8charset.decode(inputBuffer);
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
String z = new String(outputData, iso88591charset);
System.out.println(z);
prints
Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »
You should first and foremost get the string in correct internal representation in java before even thinking about output. I.E. it should be that:
z.equals("Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »") == true
The above can be verified without any output encoding issues, because it simply prints true
or false
.
In Windows you already achieved this with
ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes());
CharBuffer data = utf8charset.decode(inputBuffer);
Because all you need to go from "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »"
to "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »"
is:
ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes( windows1252/*explicit windows1252 works on CentOS too*/));
CharBuffer data = utf8charset.decode(inputBuffer);
After this you do something with ISO-8859-1, which is futile because barely half the characters in your original string
can be represented in ISO-8859-1 not to mention you are already done as per above. You can delete the code after utf8charset.decode(inputBuffer)
So now your code could look like:
String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";
Charset windows1252 = Charset.forName("Windows-1252");
Charset utf8charset = Charset.forName("UTF-8");
byte[] bytes = x.getBytes(windows1252);
String z = new String(bytes, utf8charset);
//Still wondering why you didn't just have this literal to begin with
//Check that the strings are internally equal so you know at least that
//the code is working
System.out.println(z.equals( "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »"));
System.out.println(z);
Three possibilities spring to mind:
- The encoding you're actually using for your source code may differ by platform
- The encoding the compiler expects by default may differ by platform (you can specify it on the command line)
- The platform default encoding used when you call
x.getBytes()
may differ by platform
It's not clear in what way you're trying to convert from UTF-8 to ISO-8859-1 - because your original data is actually just a String
. You're treating the results of calling x.getBytes()
as if it were UTF-8-encoded data, but it's just whatever the platform default is...
Likewise when you write:
String z = new String(outputData);
... that's using the platform default encoding. Don't do that.
You don't need the byte buffer stuff at all: just encode using text.getBytes(encoding)
and decode using new String(data, encoding)
.