PHP's str_replace()
was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
and I read it from there to variable
$sender
. Now that I replace the message body (UTF8 again)regards {sender}
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually,
explode()
is involved in parsing the .ini file so the problem may well lie in that very function so thestr_replace()
may well be innocent.Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range
\xC0-\xFF
. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte
\x5C
might be the second byte of a character sequence representing something else).It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a
for an ASCII character2x
for a 2-byte character3xx
for a 3-byte character4xxx
for a 4-byte characterIf you're matching, say,
a2x3xx
(a
bytes in ASCII range), sincea
<x
, and2x
cannot be a subset of3xx
or4xxx
, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.Edit: See bobince's answer for a less abstract explanation.
Yes, I think this is correct, at least I couldn't find any counter-example.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace
takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.