Is it worth using Python's re.compile?

2018-12-31 21:15发布

Is there any benefit in using compile for regular expressions in Python?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

标签: python regex
22条回答
人气声优
2楼-- · 2018-12-31 21:46

I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

Running this piece of code:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

is same as running this code:

re.match('hello', 'hello world')          # (C)

Because, when looked into the source re.py, (A + B) means:

h = re._compile('hello')                  # (D)
h.match('hello world')

and (C) is actually:

re._compile('hello').match('hello world')

So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

George's regexTest.py proved this for us.

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.

One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):

  1. Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

  2. If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)

  3. If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

  4. The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.

  5. George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

Here are the only cases that (A + B) is better than (C):

  1. If we can cache a reference of the regex object inside a class.
  2. If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

Case that (C) is good enough:

  1. We cannot cache a reference.
  2. We only use it once in a while.
  3. In overall, we don't have too many regex (assume the compiled one never get flushed)

Just a recap, here are the A B C:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

Thanks for reading.

查看更多
泪湿衣
3楼-- · 2018-12-31 21:51

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.

查看更多
浪荡孟婆
4楼-- · 2018-12-31 21:54

This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.

查看更多
萌妹纸的霸气范
5楼-- · 2018-12-31 21:55

Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast

查看更多
初与友歌
6楼-- · 2018-12-31 21:55

Performance difference aside, using re.compile and using the compiled regular expression object to do match (whatever regular expression related operations) makes the semantics clearer to Python run-time.

I had some painful experience of debugging some simple code:

compare = lambda s, p: re.match(p, s)

and later I'd use compare in

[x for x in data if compare(patternPhrases, x[columnIndex])]

where patternPhrases is supposed to be a variable containing regular expression string, x[columnIndex] is a variable containing string.

I had trouble that patternPhrases did not match some expected string!

But if I used the re.compile form:

compare = lambda s, p: p.match(s)

then in

[x for x in data if compare(patternPhrases, x[columnIndex])]

Python would have complained that "string does not have attribute of match", as by positional argument mapping in compare, x[columnIndex] is used as regular expression!, when I actually meant

compare = lambda p, s: p.match(s)

In my case, using re.compile is more explicit of the purpose of regular expression, when it's value is hidden to naked eyes, thus I could get more help from Python run-time checking.

So the moral of my lesson is that when the regular expression is not just literal string, then I should use re.compile to let Python to help me to assert my assumption.

查看更多
唯独是你
7楼-- · 2018-12-31 21:58

Using the given examples:

h = re.compile('hello')
h.match('hello world')

The match method in the example above is not the same as the one used below:

re.match('hello', 'hello world')

re.compile() returns a regular expression object, which means h is a regex object.

The regex object has its own match method with the optional pos and endpos parameters:

regex.match(string[, pos[, endpos]])

pos

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

endpos

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The regex object's search, findall, and finditer methods also support these parameters.

re.match(pattern, string, flags=0) does not support them as you can see,
nor does its search, findall, and finditer counterparts.

A match object has attributes that complement these parameters:

match.pos

The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

match.endpos

The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.


A regex object has two unique, possibly useful, attributes:

regex.groups

The number of capturing groups in the pattern.

regex.groupindex

A dictionary mapping any symbolic group names defined by (?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.


And finally, a match object has this attribute:

match.re

The regular expression object whose match() or search() method produced this match instance.

查看更多
登录 后发表回答