PHP: Split multibyte string (word) into separate c

2020-02-04 11:03发布

Trying to split this string "主楼怎么走" into separate characters (I need an array) using mb_split with no luck... Any suggestions?

Thank you!

4条回答
Deceive 欺骗
2楼-- · 2020-02-04 11:38

try a regular expression with 'u' option, for example

  $chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
查看更多
够拽才男人
3楼-- · 2020-02-04 11:47

An ugly way to do it is:

mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte
                               // when no internal encoding is set!
$string = ".....";
$chars = array();
for ($i = 0; $i < mb_strlen($string); $i++ ) {
    $chars[] = mb_substr($string, $i, 1); // only one char to go to the array
}

You should also try your way with mb_split with setting the internal_encoding before it.

查看更多
仙女界的扛把子
4楼-- · 2020-02-04 11:52

You can use grapheme functions (PHP 5.3 or intl 1.0) and IntlBreakIterator (PHP 5.5 or intl 3.0). The following code shows the diffrence among intl and mbstring and PCRE functions.

// http://www.php.net/manual/function.grapheme-strlen.php
$string = "a\xCC\x8A"  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
         ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS'  (U+00F6)

$expected = ["a\xCC\x8A", "o\xCC\x88"];
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"];

var_dump(
    $expected === str_to_array($string),
    $expected === str_to_array2($string),
    $expected2 === str_to_array3($string),
    $expected2 === str_to_array4($string),
    $expected2 ===  str_to_array5($string)
);

function str_to_array($string)
{
    $length = grapheme_strlen($string);
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {
        $ret[] = grapheme_substr($string, $i, 1);
    }

    return $ret;
}

function str_to_array2($string)
{
    $it = IntlBreakIterator::createCharacterInstance('en_US');
    $it->setText($string);

    $ret = [];
    $prev = 0;

    foreach ($it as $pos) {

        $char = substr($string, $prev, $pos - $prev);

        if ('' !== $char) {
           $ret[] = $char;
        }

        $prev = $pos;
    }

    return $ret;
}

function str_to_array3($string)
{
    $it = IntlBreakIterator::createCodePointInstance();
    $it->setText($string);

    $ret = [];
    $prev = 0;

    foreach ($it as $pos) {

        $char = substr($string, $prev, $pos - $prev);

        if ('' !== $char) {
           $ret[] = $char;
        }

        $prev = $pos;
    }

    return $ret;
}

function str_to_array4($string)
{
    $length = mb_strlen($string, "UTF-8");
    $ret = [];

    for ($i = 0; $i < $length; $i += 1) {
        $ret[] = mb_substr($string, $i, 1, "UTF-8");
    }

    return $ret;
}

function str_to_array5($string) {
    return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

When working on production environment, you need to replace invalid byte sequence with the substitute character since almost all grapheme and mbstring functions can't handle invalid byte sequence. If you have an interest, see my past answer: https://stackoverflow.com/a/13695364/531320

If you don't take of perfomance, htmlspecialchars and htmlspecialchars_decode can be used. The merit of this way is supporting various encoding other than UTF-8.

function str_to_array6($string, $encoding = 'UTF-8')
{
    $ret = [];
    str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding);
    return $ret;
}

function str_replace_callback($string, $callable, $encoding = 'UTF-8')
{
    $str_size = strlen($string);
    $string = str_scrub($string, $encoding);

    $ret = '';
    $char = '';
    $index = 0;

    for ($pos = 0; $pos < $str_size; ++$pos) {

        $char .= $string[$pos];

        if (str_check_encoding($char, $encoding)) {

            $ret .= $callable($char, $index);
            $char = '';
            ++$index;
        }

    }

    return $ret;
}

function str_check_encoding($string, $encoding = 'UTF-8')
{
    $string = (string) $string;
    return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding));
}

function str_scrub($string, $encoding = 'UTF-8')
{
    return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding));
}

If you want to learn the specification of UTF-8, the byte manipulation is the good way to practice.

function str_to_array6($string)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "\xEF\xBF\xBD";
    $size = strlen($string);
    $ret = [];

    for ($i = 0; $i < $size; $i += 1) {

        if ($string[$i] <= "\x7F") {

            $ret[] = $string[$i];

        } elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF")  {

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                $ret[] = substr($string, $i, 2);
                $i += 1;

            }

        } elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") {

            $left = "\xE0" === $string[$i] ? "\xA0" : "\x80";
            $right = "\xED" === $string[$i] ? "\x9F" : "\xBF";

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < $left || $right < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                if (!isset($string[$i+2])) {

                    $ret[] = $substitute;
                    return $ret;

                } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {

                    $ret[] = $substitute;
                    $i += 1;

                } else {

                    $ret[] = substr($string, $i, 3);
                    $i += 2;

                }

            }

        } elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") {

            $left = "\xF0" === $string[$i] ? "\x90" : "\x80";
            $right = "\xF4" === $string[$i] ? "\x8F" : "\xBF";

            if (!isset($string[$i+1])) {

                $ret[] = $substitute;
                return $ret;

            } elseif ($string[$i+1] < $left || $right < $string[$i+1]) {

                $ret[] = $substitute;

            } else {

                if (!isset($string[$i+2])) {

                    $ret[] = $substitute;
                    return $ret;

                } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {

                    $ret[] = $substitute;
                    $i += 1;

                } else {

                    if (!isset($string[$i+3])) {

                        $ret[] = $substitute;
                        return $ret;

                    } elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) {

                        $ret[] = $substitute;
                        $i += 2;

                    } else {

                        $ret[] = substr($string, $i, 4);
                        $i += 3;

                    }

                }

            }

        } else {

            $ret[] = $substitute;

        }

    }

    return $ret;

}

The result of benchmark between these functions is here.

grapheme
0.12967610359192
IntlBreakIterator::createCharacterInstance
0.17032408714294
IntlBreakIterator::createCodePointInstance
0.079245090484619
mbstring
0.081080913543701
preg_split
0.043133974075317
htmlspecialchars
0.25599694252014
byte maniplulation
0.13132810592651

The benchmark code is here.

$string = '主楼怎么走';

foreach (timer([
    'grapheme' => 'str_to_array',
    'IntlBreakIterator::createCharacterInstance' => 'str_to_array2',
    'IntlBreakIterator::createCodePointInstance' => 'str_to_array3',
    'mbstring' => 'str_to_array4',
    'preg_split' => 'str_to_array5',
    'htmlspecialchars' => 'str_to_array6',
    'byte maniplulation' => 'str_to_array7'
],
[$string]) as $desc => $time) {

  echo $desc, PHP_EOL,
       $time, PHP_EOL; 
}

function timer(array $callables, array $arguments, $repeat = 10000) {

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {

            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}
查看更多
何必那么认真
5楼-- · 2020-02-04 11:57

Assuming you have set the desired encoding and regular expression encoding for the MB functions (such as to UTF-8), you could use my method from my String class library.

/**
 * Splits a string into pieces (on whitespace by default).
 ^
 * @param string $pattern
 * @param string $target
 * @param int $limit
 * @return array
 */
public function split(string $target, string $pattern = '\s+', int $limit = -1): array
{
    return mb_split($pattern, $target, $limit);
}

By wrapping the mb_split() function in a method, I make it much easier to use. Simply invoke it with the desired value for the variable $pattern.

Remember, set the character encoding appropriately for your task.

mb_internal_encoding('UTF-8');  // For example.
mb_regex_encoding('UTF-8');     // For example.

In the case of my wrapper method, supply the empty string to the method like so.

$string = new String('UTF-8', 'UTF-8');  // Sets the internal and regex encodings.
$string->split($yourString, "")

In the direct PHP case ...

$characters = mb_split("", $string);
查看更多
登录 后发表回答