I have a large collection of regular expression that when matched call a particular http handler. Some of the older regex's are unreachable (e.g. a.c* ⊃ abc*
) and I'd like to prune them.
Is there a library that given two regex's will tell me if the second is subset of the first?
I wasn't sure this was decidable at first (it smelled like the halting problem by a different name). But it turns out it's decidable.
If the regular expressions use "advanced features" of typical procedural matchers (like those in Perl, Java, Python, Ruby, etc.) that allow accepting languages that aren't regular, then you are out of luck. The problem is in general undecidable. E.g. the problem of whether one pushdown automaton recognizes the same context free (CF) language as another is undecidable. Extended regular expressions can describe CF languages.
On the other hand, if the regular expressions are "true" in the theoretical sense, consisting only of concatenation, alternation, and Kleene star over strings with a finite alphabet, plus the usual syntactic sugar on these (character classes, +, ?, etc), then there is a simple polynomial time algorithm.
I can't give you libraries, but this:
Converting a regex to a DFA generally uses Thompson's construction to get a non-deterministic automaton. This is converted to a DFA using the Subset Construction. The cross-product machine is another standard algorithm.
This was all worked out in the 1960's and is now part of any good undergrad computer science theory course. The gold standard for the topic is Hopcroft and Ullman, Automata Theory.
I found a python regex library that provides set operations.
http://github.com/ferno/greenery
The proof says
Sub ⊆ Sup ⇔ Sub ∩ ¬Sup is {}
. I can implement this with the python library:examples:
The library may not be robust because as mentioned in the other answers you need to use the minimal DFA in order for this to work. I'm not sure ferno's library makes (or can make) that guarantee.
As an aside: playing with the library to calculate inverse or simplify regexes is lots of fun.
a(b|.).*
simplifies toa.+
. Which is pretty minimal.The inverse of
abf*
is([^a]|a([^b]|bf*[^f])).*|a?
. Try to come up with that on your own!Trying to find the complexity of this problem lead me to this paper.
The formal definition of the problem can be found within: this is generally called the inclusion problem
That paper has some great information (summary: all but the simplest expressions are fairly complex), however searching for information on the inclusion problem leads one directly back to StackOverflow. That answer already had a link to a paper describing a passable polynomial time algorithm which should cover a lot of common cases.
There is an answer in the mathematics section: https://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another.
Basic idea:
It would be benificial if you would just compute the new transition and the resulting states, omitting all non reachable states from the beginning.