Can php detect 4-byte encoded utf8 chars?

2019-01-31 13:23发布

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "

标签: php utf8mb4
2条回答
\"骚年 ilove
2楼-- · 2019-01-31 14:08

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('d                                                                    
查看更多
老娘就宠你
3楼-- · 2019-01-31 14:12

This should work:

if (max(array_map('ord', str_split($string))) >= 240) 

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

查看更多
登录 后发表回答