可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I\'m trying to search a UTF8-encoded string using preg_match.

preg_match(\'/H/u\', \"\\xC2\\xA1Hola!\", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since \"H\" is at index 1 in the string \"¡Hola!\". But it prints 2. So it seems like it\'s not treating the subject as a UTF8-encoded string, even though I\'m passing the \"u\" modifier in the regular expression.

I have the following settings in my php.ini, and other UTF8 functions are working:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

回答1:

Looks like this is a \"feature\", see http://bugs.php.net/bug.php?id=37391

\'u\' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP\'s point of view, strings are byte sequences and returning byte offset seems logical (i don\'t say \"correct\").

回答2:

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = \"\\xC2\\xA1Hola!\";
preg_match(\'/H/u\', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

回答3:

Try adding this (*UTF8) before the regex:

preg_match(\'(*UTF8)/H/u\', \"\\xC2\\xA1Hola!\", $a_matches, PREG_OFFSET_CAPTURE);

Magic, thanks to a comment in http://www.php.net/manual/es/function.preg-match.php#95828

回答4:

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.

     mb_internal_encoding(\'UTF-8\');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = \'preg_match\';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= \'_all\';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = \'Попробуем русскую строку для теста\';
    $s2 = \'Try english string for test\';

    var_dump(pregMatchCapture(true, \'/обу/\', $s1));
    var_dump(pregMatchCapture(false, \'/обу/\', $s1));

    var_dump(pregMatchCapture(true, \'/lish/\', $s2));
    var_dump(pregMatchCapture(false, \'/lish/\', $s2));

Output of my example:

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) \"обу\"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) \"обу\"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) \"lish\"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) \"lish\"
        [1]=>
        int(7)
      }
    }

回答5:

If all you want to do is find the multi-byte safe position of H try mb_strpos()

mb_internal_encoding(\'UTF-8\');
$str = \"\\xC2\\xA1Hola!\";
$pos = mb_strpos($str, \'H\');
echo $str.\"\\n\";
echo $pos.\"\\n\";
echo mb_substr($str,$pos,1).\"\\n\";

Output:

¡Hola!
1
H

回答6:

I wrote small class to convert offsets returned by preg_match to proper utf offsets:

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

You can use it like that:

$content = \'aą bać d\';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all(\'#(bać)#ui\', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo \"bad: \" . mb_substr($content, $offset, mb_strlen($word)).\"\\n\";
    echo \"good: \" . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word)).\"\\n\";
}

https://3v4l.org/8Y32J

回答7:

You might want to look at T-Regx library.

pattern(\'H\', \'u\')->match(\'\\xC2\\xA1Hola!\')->first(function (Match $match) 
{
    echo $match->offset();
});

This $match->offset() is UTF-8 safe offset.

preg_match and UTF-8 in PHP

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

preg_match and UTF-8 in PHP

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮