Note: What I think I know is probably wrong, so please kindly fix my knowledge :)
I just answered a question about UTF-8 and PHP.
I suggested using str_ireplace('Волгоград', '', $a)
.
I didn't expect this to work, but it did.
I always thought PHP treated one byte as one character, hence why you need to use mb_*
functions to get accurate results when using characters outside of ASCII range.
I assumed the Russian characters would take > 1 byte each.
I thought str_replace()
would work because the bytes could be matched regardless of whether they are multibyte or not, as long as they are in order.
I thought str_ireplace()
would not work because PHP wouldn't know how to map the non ASCII characters to their alternate case equivalent. But, it did work.
Where and how am I wrong? Give me as much information as you can :)
Its the other way round: PHP does not treat every character as a byte, but it treats every byte as a character. So multiple characters are seen as multiple characters (and propably not that one you expect).
Another possible explanation. The Unicode planes have similar attributes as the ISO-8859-1 range.
Converting an uppercase letter into lowercase just requires adding
0x20
for the ASCII range:And -I did not bother to look it up- I think it's the same for the Latin-1 range in 0xC0-0xDF. And this coincidentally might work for the Russian letters in the Unicode range too:
The difference is just that 0x20 has been added on the bytes which were assumed to be L1 characters. So it's probably really just a locale setting.
It works by making the text lower case by passing it to the libc functions which are dependent on the locale settings; appropriate settings means that the text will lower case properly if the correct charset is used for the bytes.