Avoid printing unicode replacement character in Ja

2019-09-12 01:15发布

问题:

In Java, why does Character.toString((char) 65533) print out this symbol: � ?

I have a Java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?

回答1:

One of the most likely scenarios is that you are trying to read ISO-8859 data using the UTF-8 character set. If you come across a sequence of characters that is not valid UTF-8, then it will be replaced with the � symbol.

Check your input streams, and ensure that you read them using the correct character set.



回答2:

In java, why does Character.toString((char) 65533) print out this symbol: � ?

Because exact this particular character IS associated with the particular codepoint. It does not display a random character as you seem to think.

I have a java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?

Your problem lies somewhere else. It at least boils down that you should set every step which involves byte-char conversions (storing text in file/db, reading text from file/db, manipulating text, transferring text, displaying text, etcetera) to use UTF-8.

Which catches my eye is the fact that Java does absolutely nothing special with 0xFFFD, it just replaces uncovered chars by a question mark ? and that while you keep insisting that 0xFFFD comes from Java. I know that Firefox does exactly what you said, so are you maybe confusing "Firefox" with "Java"?

If this is true and you're actually talking about a Java webapplication, then you need to set at least the HTTP response encoding to UTF-8. You can do that by putting <%@ page pageEncoding="UTF-8" %> in top of the JSP page in question. You may find this article useful to get more background information and a detailed overview of all steps and solutions you need to apply to solve this "Unicode problem".



回答3:

There is no Unicode character U+FFFD. Hence, the code is logically incorrect. The intended use of the Unicode Replacement Symbol is to be substitued for bad input (such as (char)65533).

How to fix it: don't put junk in strings. Strings are for text. Bytes are for random binary data.



回答4:

Well, what do you want it to do? If you're getting these characters "all over the place" I suspect you have bad data... it should be pretty rare that you receive data which can't be represented in Unicode.

How are you getting the data to start with?



回答5:

Have a look at this primer on character encodings.