Note: What I think I know is probably wrong, so please kindly fix my knowledge :)
I just answered a question about UTF-8 and PHP.
I suggested using str_ireplace('Волгоград', '', $a)
.
I didn't expect this to work, but it did.
I always thought PHP treated one byte as one character, hence why you need to use mb_*
functions to get accurate results when using characters outside of ASCII range.
I assumed the Russian characters would take > 1 byte each.
I thought str_replace()
would work because the bytes could be matched regardless of whether they are multibyte or not, as long as they are in order.
I thought str_ireplace()
would not work because PHP wouldn't know how to map the non ASCII characters to their alternate case equivalent. But, it did work.
Where and how am I wrong? Give me as much information as you can :)
It works by making the text lower case by passing it to the libc functions which are dependent on the locale settings; appropriate settings means that the text will lower case properly if the correct charset is used for the bytes.
Another possible explanation. The Unicode planes have similar attributes as the ISO-8859-1 range.
Converting an uppercase letter into lowercase just requires adding 0x20
for the ASCII range:
0x41 A
0x61 a
And -I did not bother to look it up- I think it's the same for the Latin-1 range in 0xC0-0xDF. And this coincidentally might work for the Russian letters in the Unicode range too:
d092d09ed09bd093d09ed093d0a0d090d094 ВОЛГОГРАД
d0b2d0bed0bbd0b3d0bed0b3d180d0b0d0b4 волгоград
The difference is just that 0x20 has been added on the bytes which were assumed to be L1 characters. So it's probably really just a locale setting.
Its the other way round: PHP does not treat every character as a byte, but it treats every byte as a character. So multiple characters are seen as multiple characters (and propably not that one you expect).