What would be the consequences of inserting a positive lookbehind for n-bytes, (?<=\C{n})
, into the beginning of any arbitrary regular expression, particularly when used for replacement operations?
At least within PHP, the regex match functions, preg_match
and preg_match_all
, allow for matching to begin after a given byte offset. There is no corresponding feature in any of the other PCRE PHP functions - you can specify a limit to the number of replacements done by preg_replace
for instance, but not that those replacements' matches must occur after n-bytes.
There would obviously be some (lets call them trivial) consequences to performance and readability, but would there be any (non-trivial) impacts, like matches becoming non-matches (except when they are not offset by n bytes) or replacements becoming malformed?
Some examples:
/some expression/
becomes /(?<=\C{4})some expression/
for a 4-byte offset
/(this) has (groups)/i
becomes /(?<=\C{2})(this) has (groups)/i
for a 2-byte offset
As far as I can tell, and from the limited tests that I've run, adding in this lookbehind effectively simulates this offset parameter and doesn't mess with any other lookbehinds, substitutions, or other control patterns; but I'm also not an expert on Regex.
I'm trying to determine if there are any likely consequences to building replace/filter function extensions by inserting the n-byte lookbehind into patterns. It should operate just as the match functions' offset parameter works - so simply running the expression against substr( $subject, $offset )
won't work for the same reasons it doesn't for preg_match
(most notably it cuts off any lookbehinds and ^
then incorrectly matches the start of the substring, not the original string).
You could try
/(?<=[\x00-\xFF]{n})some expression/
for a 'n'-byte
offset. Add anchors or some other soft anchors that do the start alignment.Short answer
In non-UTF mode, UTF-8 library
Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode
is equivalent to
and
Any of them can be used in a look-behind as replacement for
offset
field inpreg_match
andpreg_match_all
functions.In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.
In UTF-mode, UTF-8 library
UTF mode can be activated by
u
flag in the pattern passed topreg_*
function, or by specifying(*UTF)
,(*UTF8)
,(*UTF16)
,(*UTF32)
verbs at the beginning of the pattern.In UTF mode, character class
[]
and dot metacharacter.
will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.While
\C
is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.UTF-16 and UTF-32 library
I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of
preg_*
functions may differ.More details
At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.
For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.
Regex constructs
In non-UTF mode, character class
[]
, dot.
and\C
matches exactly 1 data unit.\C
matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though..
matches 1 data unit in non-UTF mode.Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.
The same conclusion can be reached by looking at the upper limit of
\x{hh...}
syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.Offset
All offset supplied and returned are in number of data units: