When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.
But then I thought I'd test it, comparing two scenarios:
- a single compilation of a simple regex, then 10 applications of that compiled regex.
- 10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times).
However, the results were staggering (in Python 3.3):
>>> import timeit
>>> timeit.timeit(setup="import re",
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")')
18.547793477671938
>>> timeit.timeit(setup="import re",
... stmt='for i in range(10):\n re.search(r"\w+"," jkdhf ")')
106.47892003890324
That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.
Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.
The code has changed.
In Python 2.7, the cache is a simple dictionary; if more than
_MAXCACHE
items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of_compile()
In Python 3.x, the cache has been replaced by the
@functools.lru_cache(maxsize=500, typed=True)
decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible viare._compile.cache_info()
). See the 3.3 implementation of_compile()
and offunctools.lru_cache()
.Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the
lru_cache
implementation is improved or there
module will move to a custom cache again.Update: With revision 4b4ffffdd670d0 the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.