Consequences of Inserting Positive Lookbehind into

2019-06-15 18:14发布

问题:

What would be the consequences of inserting a positive lookbehind for n-bytes, (?<=\C{n}), into the beginning of any arbitrary regular expression, particularly when used for replacement operations?

At least within PHP, the regex match functions, preg_match and preg_match_all, allow for matching to begin after a given byte offset. There is no corresponding feature in any of the other PCRE PHP functions - you can specify a limit to the number of replacements done by preg_replace for instance, but not that those replacements' matches must occur after n-bytes.

There would obviously be some (lets call them trivial) consequences to performance and readability, but would there be any (non-trivial) impacts, like matches becoming non-matches (except when they are not offset by n bytes) or replacements becoming malformed?

Some examples:

/some expression/ becomes /(?<=\C{4})some expression/ for a 4-byte offset

/(this) has (groups)/i becomes /(?<=\C{2})(this) has (groups)/i for a 2-byte offset

As far as I can tell, and from the limited tests that I've run, adding in this lookbehind effectively simulates this offset parameter and doesn't mess with any other lookbehinds, substitutions, or other control patterns; but I'm also not an expert on Regex.

I'm trying to determine if there are any likely consequences to building replace/filter function extensions by inserting the n-byte lookbehind into patterns. It should operate just as the match functions' offset parameter works - so simply running the expression against substr( $subject, $offset ) won't work for the same reasons it doesn't for preg_match (most notably it cuts off any lookbehinds and ^ then incorrectly matches the start of the substring, not the original string).

回答1:

Short answer

In non-UTF mode, UTF-8 library

Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode

\C

is equivalent to

[\x00-\xff]

and

(?s:.)

Any of them can be used in a look-behind as replacement for offset field in preg_match and preg_match_all functions.

In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.

In UTF-mode, UTF-8 library

UTF mode can be activated by u flag in the pattern passed to preg_* function, or by specifying (*UTF), (*UTF8), (*UTF16), (*UTF32) verbs at the beginning of the pattern.

In UTF mode, character class [] and dot metacharacter . will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.

While \C is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.

UTF-16 and UTF-32 library

I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of preg_* functions may differ.

More details

At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.

For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.

Regex constructs

In non-UTF mode, character class [], dot . and \C matches exactly 1 data unit.

  • \C matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.

    MATCHING A SINGLE DATA UNIT

    Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set.

  • . matches 1 data unit in non-UTF mode.

    General comments about UTF modes

    [...]

    1. The dot metacharacter matches one UTF character instead of a single data unit.
  • Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.

    SQUARE BRACKETS AND CHARACTER CLASSES

    [...]

    A character class matches a single character in the subject. In a UTF mode, the character may be more than one data unit long.

    The same conclusion can be reached by looking at the upper limit of \x{hh...} syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.

    Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:

     8-bit non-UTF mode    less than 0x100
     8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
     16-bit non-UTF mode   less than 0x10000
     16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
     32-bit non-UTF mode   less than 0x100000000
     32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
    

    Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef.

Offset

All offset supplied and returned are in number of data units:

The string to be matched by pcre_exec()

The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.

How pcre_exec() returns captured substrings

[...]

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit off- sets, even in UTF mode.



回答2:

You could try /(?<=[\x00-\xFF]{n})some expression/ for a 'n'-byte offset. Add anchors or some other soft anchors that do the start alignment.