Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR
of substrings:
foo(a|b|cd)bar
should match fooabar
, foobbar
and foocdbar
.
Examples
Given the set of strings h_q1_a
, h_q1_b
, h_q1_c
, h_p2_a
, h_p2_b
, h_p2_c
, the desired output of the algorithm would be h_(q1|p2)_(a|b|c)
.
Given the set of strings h_q1_a
, h_q1_b
, h_p2_a
, the desired output of the algorithm would be h_(q1_(a|b)|p2_a)
. Note that h_(q1|p2)_(a|b)
would not be correct because that expand to 4 strings, including h_p2_b
, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)
You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:
will generate a finite machine that most probably look like this:
Now for every level/depth of the trie all the stings (if multiple) will go under
OR
brackets, soThere is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (
|
) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)The procedure is simple, if you have access to standard FSM algorithms:
Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly
O(N)
in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is
O(N)
in size, but I don't have a proof off the top of my head that it is alsoO(N)
in time.Minimize the DFA.
DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case
O(NS log N)
algorithm, whereN
is the number of states in the DFA andS
is the size of the alphabet. Normally,S
would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.