PHP: case-insensitive preg_replace of a cyrillic s

2020-04-08 12:47发布

I have a PHP 5.3 script displaying users of my web site and would like to replace a certain Russian city (stored in UTF8 in PostgreSQL 8.4.7 database + CentOS 5.5/64 bits Linux) by its older name (it is an insider joke):

preg_replace('/Волгоград/iu', 'Сталинград', $city);

Unfortunately this only works for exact matches: Волгоград.

This does not work for other cases, like ВОЛГОГРАД or волгоград.

If I modify my source code to

preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);

then it will catch the 2nd case above.

Does anybody know what it going on and how to fix it (assuming I don't want to write [Xx] for every letter)?

Thank you! Alex

UPDATE:

# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5

# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7

9条回答
Summer. ? 凉城
2楼-- · 2020-04-08 13:24

Just guessing, but explicitly encoding the string to unicode may help:

preg_replace('/Волгоград/iu', utf8_encode('Сталинград'), $city);
查看更多
女痞
3楼-- · 2020-04-08 13:34

Actually with PHP 5.2.x on windows the selected for a solved answer did not work for me.

I had to go through converting to Windows-1251 to make it work.

Here you go the example:

$new_content = preg_replace(iconv('UTF-8', 'Windows-1251', "/\bгъз\b/i"), iconv('UTF-8', 'Windows-1251', "YYYYYY"), iconv('UTF-8', 'Windows-1251', "ти си gyz gyz гъз ГЪЗ gyzgyz гЪз gyz"));
$new_content = iconv('Windows-1251', 'UTF-8', $new_content);

The example above will substitute successfully (case-insesitively) 'гъз' with YYYYYY and give you back the UTF-8 version.

Regards!

查看更多
男人必须洒脱
4楼-- · 2020-04-08 13:40

I copy+pasted your big В. It is indeed U+D092, not the normal latin B. But since they look so much alike: ВB I believe the russian letter is collated onto the Latin B of U+0042.

So either it's PHP preformatting it, or maybe PCRE is somewhat inexact there too. Test your print PCRE_VERSION; and have a look into the changelog.

Anyway, to evade the problem I would suggest you only use the lowercase letters. They are more likely to be distinct from the Latin alphabet.

preg_replace('/волгоград/iu', 'Сталинград', $city);

P.S.: Evil inside joke!

查看更多
SAY GOODBYE
5楼-- · 2020-04-08 13:40

Works like a charm on my box...

<?php
    $city = 'Волгоград';
    var_dump(preg_match('/волгоград/ui', $city));
    var_dump(preg_match('/ВОЛГОГРАД/ui', $city));
    var_dump(preg_replace('/волгоград/ui', 'Сталинград', $city));
    var_dump(preg_replace('/ВОЛГОГРАД/ui', 'Сталинград', $city));

Output:

int 1
int 1
string 'Сталинград' (length=20)
string 'Сталинград' (length=20)

Are you sure that input data ($city) is in UTF8?

查看更多
何必那么认真
6楼-- · 2020-04-08 13:45

You can skip the regex, it worked for me in PHP 5.2.11 :)

$city = 'Unfortunately this only works for exact matches: Волгоград.

This does not work for other cases, like ВОЛГОГРАД or волгоград.';

echo str_ireplace('Волгоград', '[found]', $city);

Output

"Unfortunately this only works for exact matches: [found].

This does not work for other cases, like [found] or [found]."

This intrigued me, so I asked a question.

查看更多
我欲成王,谁敢阻挡
7楼-- · 2020-04-08 13:46

I cannot reproduce your issue with a PHP 5.3.3 (PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)):

$str1 = 'Волгоград';
$str2 = 'ВОЛГОГРАД';
$str3 = 'волгоград';

var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str1));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str2));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str3));

outputs

string(20) "Сталинград"
string(20) "Сталинград"
string(20) "Сталинград"

Which PCRE version is your PHP using? Check you phpinfo() for the pcre-section. That's the one on my system:

...
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19
...
查看更多
登录 后发表回答