How to replace/remove 4(+)-byte characters from a

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I'm looking a clean way to replace these characters.

Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.

N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.

标签： java mysql utf-8

3条回答

聊天终结者

2楼-- · 2019-03-14 16:51

We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

The offset calculations are to make sure we stay on the unicode code points.

public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD"; 

public static String toValid3ByteUTF8String(String s)  {
    final int length = s.length();
    StringBuilder b = new StringBuilder(length);
    for (int offset = 0; offset < length; ) {
       final int codepoint = s.codePointAt(offset);

       // do something with the codepoint
       if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
           b.append(CharUtils.REPLACEMENT_CHAR);
       } else {
           if (Character.isValidCodePoint(codepoint)) {
               b.appendCodePoint(codepoint);
           } else {
               b.append(CharUtils.REPLACEMENT_CHAR);
           }
       }
       offset += Character.charCount(codepoint);
    }
    return b.toString();
}

0人赞添加讨论(0) 举报

Bombasti

3楼-- · 2019-03-14 16:51

5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.

Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

4楼-- · 2019-03-14 17:01

Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:

text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

0人赞添加讨论(0) 举报

How to replace/remove 4(+)-byte characters from a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间