I need to match certain URLs in web application, i.e. /123,456,789
, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?
First off, I must say that it is not a BUG. its because of that , it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time. When it gets really bad, it’s called catastrophic backtracking.
This is the code of
findall
function in python source code:As you see it just use the compile() function, so based on
_compile()
function that actually useTraditional NFA
that python use for its regex matching, and base on this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!Let's go inside your pattern: So you have
r'(\d+(,)?)+/$'
with this string'12345121,223456,123123,3234,4523,523523'
we have this steps:12345121
) is matched with\d+
, then,
is matched with(,)?
.+
after the grouping ((\d+(,)?)+
)/$
to be matched. Therefore,(\d+(,)?)+
needs to "backtrack" to one character before the last for check for/$
. Again, it don't find any proper match, so after that it's (,
)'s turn to backtrack, then\d+
will backtrack, and this backtracking will be continue to end till it returnNone
. So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
So to avoid this problem you can use one of the below ways:
To avoid the catastrophic backtracking I suggest
There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string:
[\d,]+/$
. However, that is not perfect since it would allow for something like,123,,4,5/
.For this you can use a slightly optimized version of your initial try:
(?:\d,?)+/$
. First, I made your repeating group non-capturing ((?:...)
) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the\d
inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional,
since?
only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing/
.This can still match an odd string
1,2,3,/
, so for the heck of it I improved your original regex with a negative lookbehind:(?:\d,?)+(?<!,)/$
. This will assert that there is no comma directly before the trailing/
.