I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO-8859-1, or perhaps WINDOWS-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?
function make_safe_for_utf8_use($string) {
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
if ($encoding != 'UTF-8') {
return iconv($encoding, 'UTF-8//TRANSLIT', $string);
} else {
return $string;
Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about charsets. This page links to a page specifically for utf8.
Not sure if this would achieve the same thing, but couldn't you just use
on all text without worrying about detection? If the text is already UTF-8, it won't hurt it. And if it's not, it will be converted. If you've already thought about doing this, is there a reason this wouldn't work for you?UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.
Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.
You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.
If you want to be sure, do it yourself using the W3-recommended regex:
With mbstring library, you have mb_check_encoding().
Example of use:
When performance matters,
this is faster than the regex provided in the accepted answer.A quick test on my configuration shows (for 20,000 iterations):
With PHP 7.1.9 on a recent Windows 10 system, the regex solution outperforms
for any string length (still 20,000 iterations):mb_check_encoding()
=> 64msmb_check_encoding()
=> 2.4sanswer to "iconv is idempotent"
neither is iconv - iconv is not idempotent
a big difference between utf8_encode() & iconv() is that iconv may raise errors like this "Detected an incomplete multibyte character in input string" even with
in the above code:
you have to know mb_detect_encoding can answer uft-8 even for invalid utf-8 strings (badly formed utf8)
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity: