Why are uncompiled, repeatedly used regexes so muc

2019-01-12 02:17发布

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.

But then I thought I'd test it, comparing two scenarios:

  1. a single compilation of a simple regex, then 10 applications of that compiled regex.
  2. 10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times).

However, the results were staggering (in Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):\n re.search(r"\w+","  jkdhf  ")')
106.47892003890324

That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.

Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

1条回答
虎瘦雄心在
2楼-- · 2019-01-12 03:13

The code has changed.

In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()). See the 3.3 implementation of _compile() and of functools.lru_cache().

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache implementation is improved or the re module will move to a custom cache again.

Update: With revision 4b4ffffdd670d0 the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.

查看更多
登录 后发表回答