These past few days I've been working toward converting my PHP code base from latin1 to UTF-8. I've read the two main solutions are to either replace the single byte functions with the built in multibyte functions, or set the mbstring.func_overload value in the php.ini file.
But then I came across this thread on stack overflow, where the post by thomasrutter seems to indicate that the multibyte functions aren't actually necessary for UTF-8, as long as the script and string literals are encoded in UTF-8.
I haven't found any other evidence whether this is true or not, and if it turns out I don't need to convert my code to the mb_functions then that would be a real time saver! Anyone able to shed some light on this?
As far as I understand the issue, as long as all your data is 100% in utf-8 - and that means user input, database, and also the encoding of the PHP files themselves if you have special characters in them - this is true true for search and comparison operations. As @ntd points out, a non-multibyte strlen() will produce wrong results when run on a string that contains multibyte characters.
THis is a great article on the basics of encoding.
They aren't "necessary" unless you're using any of the functions they replace (and it's likely that you are using at least one of these) or otherwise explicitly need a feature of the extension such as HTTP handling.
When working towards UTF-8 compliance, I always fall back to the PHP UTF-8 Cheatsheet with one addition: PCRE patterns need to be updated to use the u
modifier.
As soon as you're examining or modifying a multibyte string, you need to use a mb_* function. A very quick example which demonstrates why:
$str = "abcžđščćöçefg";
mb_internal_encoding("UTF-8");
echo "strlen: ".strlen($str)."\n";
echo "mb_strlen: ".mb_strlen($str)."\n";
This prints out:
strlen: 20
mb_strlen: 13
thomasrutter indicates that the search does not need special handling. For example, if you need to check the length of an UTF8 string, I don't see how you can do that using plain strlen()
.
Functions such as mb_strtoupper may be necessary, too. strtoupper won't convert á to Á.
There are a number of functions that expect strings to be single byte (And some even presume that it is iso-8859-1). In these cases, you need to be aware of what you're doing and possibly use replacement functions. There is a fairly comprehensive list at: http://www.phpwact.org/php/i18n/utf-8
You could use the mbfunctions library that extends the multibyte functions in PHP:
http://code.google.com/p/mbfunctions/
You can use this
http://php.net/manual/en/mbstring.overload.php
setting in php.ini file, so you don't need to change you code.
But be careful, because not all string function will be automatically changed.
This is one: http://php.net/manual/en/function.substr-replace.php