We ran some java code using cron in Linux to persist thousands of records in production database. The locale charmap in that box was "ANSI_X3.4-1968". Now, we took following steps before persisting those to database.
1. Use StringEscapeUtils.unescapeHtml4 on the text
2. Write the String in UTF-8 format and persist in database
Now the problem is after these steps special characters are showing up as "?". Is it possible to revert it back to the original character?
I have simulated the problem with following steps.
- Change Eclipse encoding to "ANSI_X3.4-1968"
- Write following lines of codes
String insertSpecial = StringEscapeUtils.unescapeHtml4("×");
System.out.println(insertSpecial);
String uni = new String(insertSpecial.getBytes(), "UTF-8");// This value is currently in DB
System.out.println(uni);
Now I want to get back "×" from the String "uni". Any help will be appreciated.
Basically no. You made the biggest mistake in new String(insertSpecial.getBytes(), "UTF-8");
which again shows that character encoding is surprisingly difficult to handle.
What that piece of code does, step by step:
- Give me the bytes from
insertSpecial
in the platform encoding
- Create a new String from the bytes, telling that the bytes are UTF-8 (even though the bytes were gotten in platform encoding just previously)
I've seen this code several times, and unfortunately it only breaks things. It's completely unnecessary and it doesn't "convert" anything even if it were written correctly. If the platform encoding is not UTF-8
then it will most likely destroy any special characters (or even the whole String if there's a suitable difference between platform encoding and the one given in the String constructor).
The question mark is a placeholder for a character that could not be converted, meaning it's forever gone.
Here's some reading so you won't make that mistake again: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)