Typically in our work we use regular expressions in capture or match operations.
However, regular expressions can be used - manually at least - to generate legal sentences that match the regular expression. Of course, some regular expressions can match infinitely long sentences, e.g., the expression .+
.
I have a problem that could be solved by using a regular expression sentence generating algorithm.
In pseudocode, it would operate something like this:
re = generate("foo(bar|baz)?", max_match = 100); #Don't give me more than 100 results
assert re == ("foobar", "foobaz", "foo");
What algorithm would perform this for me?
Microsoft has a SMT-based gratis (MSRL-licensed) "Rex" tool for this: http://research.microsoft.com/en-us/downloads/7f1d87be-f6d9-495d-a699-f12599cea030/
From the Introduction section of the "Rex: Symbolic Regular Expression Explorer" paper:
We translate (extended) regular expressions or regexes [5] into a symbolic representation of finite automata called SFAs. In an SFA, moves are labeled by formulas representing sets of characters rather than individual characters. An SFA A is translated into a set of (recursive) axioms that describe the acceptance condition for the strings accepted by A and build on the representation of strings as lists.
As the SMT solver can output all possible solutions within some size bound, this may be close to what you're looking for.
On a more statistical and less formal front, the Regexp::Genex module from CPAN can work as well: http://search.cpan.org/dist/Regexp-Genex/
You can use it with something like this:
#!/usr/bin/env perl
use Regexp::Genex ':all';
my $hits = 100;
my $re = qr/[a-z](123|456)/;
local $Regexp::Genex::DEFAULT_LEN = length $re;
my %seen;
while ((time - $^T) < 2) {
@seen{strings($re)} = ();
$Regexp::Genex::DEFAULT_LEN++;
}
print "$_\n" for (sort %seen)[0..$hits-1];
Adjust the time and sample size as needed. Hope this helps!
Take a look at Xeger (Google Code).
The Visual Studio Team System appears to have an inverse regex generator, too, but it doesn't look like the algorithm is open source.