可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a PHP 5.3 script displaying users of my web site and would like to replace a certain Russian city (stored in UTF8 in PostgreSQL 8.4.7 database + CentOS 5.5/64 bits Linux) by its older name (it is an insider joke):
preg_replace('/Волгоград/iu', 'Сталинград', $city);
Unfortunately this only works for exact matches: Волгоград.
This does not work for other cases, like ВОЛГОГРАД or волгоград.
If I modify my source code to
preg_replace('/[Вв]олгоград/iu', 'Сталинград', $city);
then it will catch the 2nd case above.
Does anybody know what it going on and how to fix it (assuming I don't want to write [Xx] for every letter)?
Thank you!
Alex
UPDATE:
# rpm -qa|grep php
php53-bcmath-5.3.3-1.el5
php53-gd-5.3.3-1.el5
php53-common-5.3.3-1.el5
php53-pdo-5.3.3-1.el5
php53-mbstring-5.3.3-1.el5
php53-xml-5.3.3-1.el5
php53-5.3.3-1.el5
php53-cli-5.3.3-1.el5
php53-pgsql-5.3.3-1.el5
# rpm -qa|grep pcre
pcre-6.6-2.el5_1.7
回答1:
I cannot reproduce your issue with a PHP 5.3.3 (PHP 5.3.3-1ubuntu9.3 with Suhosin-Patch (cli)
):
$str1 = 'Волгоград';
$str2 = 'ВОЛГОГРАД';
$str3 = 'волгоград';
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str1));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str2));
var_dump(preg_replace('/Волгоград/iu', 'Сталинград', $str3));
outputs
string(20) "Сталинград"
string(20) "Сталинград"
string(20) "Сталинград"
Which PCRE version is your PHP using? Check you phpinfo()
for the pcre
-section. That's the one on my system:
...
pcre
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19
...
回答2:
You can skip the regex, it worked for me in PHP 5.2.11 :)
$city = 'Unfortunately this only works for exact matches: Волгоград.
This does not work for other cases, like ВОЛГОГРАД or волгоград.';
echo str_ireplace('Волгоград', '[found]', $city);
Output
"Unfortunately this only works for exact matches: [found].
This does not work for other cases, like [found] or [found]."
This intrigued me, so I asked a question.
回答3:
This one solved the problem:
setlocale(LC_ALL, 'ru_RU.CP1251', 'rus_RUS.CP1251', 'Russian_Russia.1251');
回答4:
I copy+pasted your big В
. It is indeed U+D092
, not the normal latin B
. But since they look so much alike: ВB
I believe the russian letter is collated onto the Latin B of U+0042
.
So either it's PHP preformatting it, or maybe PCRE is somewhat inexact there too. Test your print PCRE_VERSION;
and have a look into the changelog.
Anyway, to evade the problem I would suggest you only use the lowercase letters. They are more likely to be distinct from the Latin alphabet.
preg_replace('/волгоград/iu', 'Сталинград', $city);
P.S.: Evil inside joke!
回答5:
Works like a charm on my box...
<?php
$city = 'Волгоград';
var_dump(preg_match('/волгоград/ui', $city));
var_dump(preg_match('/ВОЛГОГРАД/ui', $city));
var_dump(preg_replace('/волгоград/ui', 'Сталинград', $city));
var_dump(preg_replace('/ВОЛГОГРАД/ui', 'Сталинград', $city));
Output:
int 1
int 1
string 'Сталинград' (length=20)
string 'Сталинград' (length=20)
Are you sure that input data ($city) is in UTF8?
回答6:
Perhaps try: mb_eregi_replace
http://www.php.net/manual/en/function.mb-eregi-replace.php
mb_eregi_replace — Replace regular expression with multibyte support ignoring case
回答7:
Just guessing, but explicitly encoding the string to unicode may help:
preg_replace('/Волгоград/iu', utf8_encode('Сталинград'), $city);
回答8:
Actually with PHP 5.2.x on windows the selected for a solved answer did not work for me.
I had to go through converting to Windows-1251 to make it work.
Here you go the example:
$new_content = preg_replace(iconv('UTF-8', 'Windows-1251', "/\bгъз\b/i"), iconv('UTF-8', 'Windows-1251', "YYYYYY"), iconv('UTF-8', 'Windows-1251', "ти си gyz gyz гъз ГЪЗ gyzgyz гЪз gyz"));
$new_content = iconv('Windows-1251', 'UTF-8', $new_content);
The example above will substitute successfully (case-insesitively) 'гъз' with YYYYYY and give you back the UTF-8 version.
Regards!
回答9:
for those who support a huge legacy code base, struggling with charset & encoding issues, and without option to convert code charset - here's an answer:
//for
setlocale(LC_ALL, 'ru_RU.cp1251');
//(or any other locale) to take effect,
//you MUST generate system locale, i.e.
sudo su
#view supported locales
#less /usr/share/i18n/SUPPORTED
echo "ru_RU.cp1251 CP1251" >> /var/lib/locales/supported.d/local
dpkg-reconfigure locales
exit
#and (for ubuntu/debian)
apt-get install php5-intl
while you can rewrite your regexp to use some utf tricks, convert your code to utf, it's not an option when you work with a huge codebase/database etc