PCRE PHP Concrete example of the usage and utility

2019-04-08 04:15发布

the PHP manual states the following about the PCRE's "S" (Extra analysis of pattern) modifier on http://php.net/manual/en/reference.pcre.pattern.modifiers.php

S

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

So its usage is related to patterns which should be used several times, without anchors inside of them (such as ^, $) or a fixed starting char sequence, e.g. in a pattern like '/^abc/'.

But there aren't any specific details on where e.g. apply this modifier and how it actually works.

Does it apply only for the PHP thread of the current executing script and after the script is executed the "cached" analysis of the pattern is lost? Or does the engine store the analysis of the pattern in a global cache which is then made available to several PHP threads that use PCRE with the pattern marked with this modifier?

Also, from the PCRE introduction: http://php.net/manual/en/intro.pcre.php

Note: This extension maintains a global per-thread cache of compiled regular expressions (up to 4096)

If the "S" modifier is used per-thread only, how does it differs from the PCRE cache of compiled regexps? I guess additional information is stored, something like MySQL does when it indexes the rows in a table (of course in the case of PCRE, this additional information is stored in memory).

Last, but not the least, have someone experienced a real use case where he/she had used this modifier and did you notice an improvement and appreciate its benefits?

Thanks for the attention.

1条回答
再贱就再见
2楼-- · 2019-04-08 04:58

PHP docs quote a small part of the PCRE docs. Here are some more details (emphasis mine) from PCRE 8.36:

If a compiled pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. The function pcre_study() takes a pointer to a compiled pattern as its first argument. If studying the pattern produces additional information that will help speed up matching, pcre_study() returns a pointer to a pcre_extra block, in which the study_data field points to the results of the study.

...

Studying a pattern does two things: first, a lower bound for the length of subject string that is needed to match the pattern is computed. This does not mean that there are any strings of that length that match, but it does guarantee that no shorter strings match. The value is used to avoid wasting time by trying to match strings that are shorter than the lower bound. You can find out the value in a calling program via the pcre_fullinfo() function.

Studying a pattern is also useful for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting bytes is created. This speeds up finding a position in the subject at which to start matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256. In 32-bit mode, the bitmap is used for 32-bit values less than 256.)

Please note that in the later PCRE version (v10.00, also called PCRE2), the lib has undergone a massive refactoring and API redesign. One of the consequences is that studying is always performed in PCRE 10.00 and above. I don't know when PHP will make use of PCRE2, but it will happen sooner or later because PCRE 8.x won't get any new features from now on.

Here's a quote from the PCRE2 release announcment:

Explicit "studying" of compiled patterns has been abolished - it now always happens automatically. JIT compiling is done by calling a new function, pcre2_jit_compile() after a successful return from pcre2_compile().


As for your second question:

If the "S" modifier is used per-thread only, how does it differs from the PCRE cache of compiled regexps?

There's no cache in PCRE itself, but PHP maintains a cache of regexes to avoid recompiling the same pattern over and over again, for instance in case you use a preg_ function inside a loop.

查看更多
登录 后发表回答