I need some help regarding how to split Chinese characters mixed with English words and numbers in PHP.
For example, if I read
FrontPage 2000中文版應用大全
I'm hoping to get
FrontPage, 2000, 中,文,版,應,用,大,全
or
FrontPage, 2,0,0,0, 中,文,版,應,用,大,全
How can I achieve this?
Thanks in advance :)
Assuming you are using UTF-8 (or you can convert it to UTF-8 using Iconv or some other tools), then using the u
modifier (doc: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php )
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/./u', $s, $matches));
echo "\n";
print_r($matches);
?>
will give
21
Array
(
[0] => Array
(
[0] => F
[1] => r
[2] => o
[3] => n
[4] => t
[5] => P
[6] => a
[7] => g
[8] => e
[9] =>
[10] => 2
[11] => 0
[12] => 0
[13] => 0
[14] => 中
[15] => 文
[16] => 版
[17] => 應
[18] => 用
[19] => 大
[20] => 全
)
)
Note that my source code is stored in a file encoded in UTF-8 also, for the $s to contain those characters.
The following will match alphanumeric as a group:
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/(\w+)|(.)/u', $s, $matches));
echo "\n";
print_r($matches[0]);
?>
result:
10
Array
(
[0] => FrontPage
[1] =>
[2] => 2000
[3] => 中
[4] => 文
[5] => 版
[6] => 應
[7] => 用
[8] => 大
[9] => 全
)
With this code you can make chinese text (utf8) to wrap at the end of the line so that it is still readable
print_r(preg_match_all('/([\w]+)|(.)/u', $str, $matches));
$arr_result = array();
foreach ($matches[0] as $key => $val) {
$arr_result[]=$val;
$arr_result[]="​"; //add Zero-Width Space
}
foreach ($arr_result as $key => $val) {
$out .= $val;
}
return $out;
/**
* Reference: http://www.regular-expressions.info/unicode.html
* Korean: Hangul
* CJK: Han
* Japanese: Hiragana, Katakana
* Flag u required
*/
preg_match_all(
'/\p{Hangul}|\p{Hiragana}|\p{Han}|\p{Katakana}|(\p{Latin}+)|(\p{Cyrillic}+)/u',
$str,
$result
);
This one is the one works for me if you are using PHP 7.0 too.
This one is just not working. I regret to up vote a not working solution....
<?
$s = "FrontPage 2000中文版應用大全";
print_r(preg_match_all('/(\w+)|(.)/u', $s, $matches));
echo "\n";
print_r($matches[0]);
?>