Migrating a php application to handle UTF-8

I am working on a multi-language app in php.

All was fine until recently I was asked to support Chinese characters. The actions I took to support UTF-8 characters are the following:

All DB tables are now UTF-8
HTML templates contain the tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The controllers send out a header specifying the encoding (utf-8) to use for the http response

All was good until I started making some string manipulations (substr and the likes)

With chinese it won't work because the chinese is represented as multibytes and hence if you do a normal substring (substr) it will prolly cut a "letter" in the middle of one of the bytes allocated and f*ck up the result on screen.

I fixed ALL my problems by adding this in the bootstrap

mb_internal_encoding("UTF-8");

and replacing all the strlen, substr, strstr with their mb_ counterparts.

What other things do I need to do to support UTF-8 fully in php?

There's a little more to it than just replacing those functions.

Regular expressions

You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.

$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);

Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?

\p{L} instead of \w for any 'letter' character.
\p{Z} instead of \s for any 'space' character.
\p{N} instead of \d for any 'digit' character e.g. Arabic numbers

There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.

Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.

Function replacements

Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.

$unsafeFunctions = array(
    'mail'      => 'mb_send_mail',
    'split'     => null, //'mb_split', deprecated function - just don't use it
    'stripos'   => 'mb_stripos',
    'stristr'   => 'mb_stristr',
    'strlen'    => 'mb_strlen',
    'strpos'    => 'mb_strpos',
    'strrpos'   => 'mb_strrpos',
    'strrchr'   => 'mb_strrchr',
    'strripos'  => 'mb_strripos',
    'strstr'    => 'mb_strstr',
    'strtolower'    => 'mb_strtolower',
    'strtoupper'    => 'mb_strtoupper',
    'substr_count'  => 'mb_substr_count',
    'substr'        => 'mb_substr',
    'str_ireplace'  => null,
    'str_split'     => 'mb_str_split', //TODO - check this works
    'strcasecmp'    => 'mb_strcasecmp', //TODO - check this works
    'strcspn'       => null, //TODO - implement alternative
    'strrev'        => 'mb_strrev', //TODO - check this works
    'strspn'        => null, //TODO - implement alternative
    'substr_replace'=> 'mb_substr_replace',
    'lcfirst'       => null,
    'ucfirst'       => 'mb_ucfirst',
    'ucwords'       => 'mb_ucwords',
    'wordwrap'      => null,
);

MySQL

Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.

It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane

To support these you should in general use:

utf8mb4 - for your character encoding.
utf8mb4_unicode_ci - for your character collation.

For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.

The list of places where you should set the character set and collation in your MySQL config file are:

[mysql]
default-character-set=utf8mb4

[client]
default-character-set=utf8mb4

[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.

PHP INI File

Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:

mbstring.language   = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding  = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On  ;  HTTP input encoding translation is enabled
mbstring.http_input     = auto  ; Set HTTP input character set dectection to auto
mbstring.http_output    = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order   = auto  ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset      = UTF-8 ; Default character set for auto content type header

Helping browser to choose UTF8 for forms

You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.

Misc

If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.

That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.