How to replace/remove 4(+)-byte characters from a

2019-03-14 16:15发布

问题:

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I'm looking a clean way to replace these characters.

Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.

N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.

回答1:

We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

The offset calculations are to make sure we stay on the unicode code points.

public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD"; 

public static String toValid3ByteUTF8String(String s)  {
    final int length = s.length();
    StringBuilder b = new StringBuilder(length);
    for (int offset = 0; offset < length; ) {
       final int codepoint = s.codePointAt(offset);

       // do something with the codepoint
       if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
           b.append(CharUtils.REPLACEMENT_CHAR);
       } else {
           if (Character.isValidCodePoint(codepoint)) {
               b.appendCodePoint(codepoint);
           } else {
               b.append(CharUtils.REPLACEMENT_CHAR);
           }
       }
       offset += Character.charCount(codepoint);
    }
    return b.toString();
}


回答2:

Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:

text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");


回答3:

5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.

Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.



标签: java mysql utf-8