Why did this str_ireplace() work on a non ASCII st

2019-06-17 02:32发布

问题:

Note: What I think I know is probably wrong, so please kindly fix my knowledge :)


I just answered a question about UTF-8 and PHP.

I suggested using str_ireplace('Волгоград', '', $a).

I didn't expect this to work, but it did.

I always thought PHP treated one byte as one character, hence why you need to use mb_* functions to get accurate results when using characters outside of ASCII range.

I assumed the Russian characters would take > 1 byte each.

I thought str_replace() would work because the bytes could be matched regardless of whether they are multibyte or not, as long as they are in order.

I thought str_ireplace() would not work because PHP wouldn't know how to map the non ASCII characters to their alternate case equivalent. But, it did work.


Where and how am I wrong? Give me as much information as you can :)

回答1:

It works by making the text lower case by passing it to the libc functions which are dependent on the locale settings; appropriate settings means that the text will lower case properly if the correct charset is used for the bytes.



回答2:

Another possible explanation. The Unicode planes have similar attributes as the ISO-8859-1 range.

Converting an uppercase letter into lowercase just requires adding 0x20 for the ASCII range:

0x41   A
0x61   a

And -I did not bother to look it up- I think it's the same for the Latin-1 range in 0xC0-0xDF. And this coincidentally might work for the Russian letters in the Unicode range too:

d092d09ed09bd093d09ed093d0a0d090d094   ВОЛГОГРАД
d0b2d0bed0bbd0b3d0bed0b3d180d0b0d0b4   волгоград

The difference is just that 0x20 has been added on the bytes which were assumed to be L1 characters. So it's probably really just a locale setting.



回答3:

Its the other way round: PHP does not treat every character as a byte, but it treats every byte as a character. So multiple characters are seen as multiple characters (and propably not that one you expect).