I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
The following regular expression will replace 4-byte UTF-8 characters:
This should work:
The rational being that code points up to and including U+FFFF are encoded as three bytes of the form
1110xxxx 10xxxxxx 10xxxxxx
. Higher code points are of the form11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.If you want to remove long characters, this will do:
Though there may be a more elegant regex way to express high codepoints directly.