I'm using a simple language of only ()
, |
, spaces, and alpha characters.
Given a regular expression like the following:
(hello|goodbye) (world(s|)|)
How would I go about generating the following data?
hello worlds
hello world
hello
goodbye worlds
goodbye world
goodbye
I'm not quite sure if I need to build a tree first, or if it can be done recursively. I'm stuck on what data structures to utilize, and how to generate the strings as I go. Will I have to keep a bunch of markers, and index back into partially built strings to concatenate more data on? I don't know how best to approach this problem. Would I need to read the whole expression first, and re-order it a certain way?
The function signature is going to look the following way:
std::vector<std::string> Generate(std::string const&){
//...
}
What do you suggest I do?
EDIT:
Let me clarify that the results should always be finite here. In my particular example, there are only 6 strings that would ever be true for the expression. I'm not sure if my terminology is correct here, but what I'm looking for, is a perfect match of the expression- not any string that contains a substring which matches.
Somewhat following Kieveli's advice, I have come up with a working solution. Although not previously mentioned, it was important for me to also get a count of how many results could potentially be generated. I was using a python script called "exrex" which I had found on github. Embarrassingly, I did not realize that it had the capability to also count. Nonetheless, I implemented it the best I could in C++ using my simplified regular expression language. If interested in my solution, please read on.
From an object oriented stand point, I wrote a scanner to take the regular expression(string), and convert it into a list of tokens(vector of strings). The list of tokens was then sent to a parser which generated an n-ary tree. All of this was packed inside an "expression generator" class that could take an expression and hold the parse tree, as well as the generated count.
The scanner was important because it tokenized the empty string case which you can see in my question appearing as "|)". Scanning also created a pattern of [word] [operation] [word] [operation] ... [word].
For example, scanning:
"(hello|goodbye) (world(s|)|)"
will create:
[][(][hello][|][goodbye][)][ ][(][world][(][s][|][][)][][|][][)][]
The parse tree was a vector of nodes. Nodes contain a vector of vector of nodes.
The orange cells represent the "or"s, and the other boxes that draw the connections, represent the "and"s. Below is my code.
Node header
Node source
Scanner header
Scanner source
Parser header
Parser source
Generator header
Generator source
In conclusion, I recommend using exrex, or any other programs mentioned in this thread, if you need to generate matches for regular expressions.
When I did my own custom little language, I wrote a parser first. The parser created a structure in memory that represented the text. For this little language, I would create a structure that's something like this:
When you parse your text, you would get a list of nodes:
Once you parse into this structure, you can imagine code that'll generate what you want given that you understand what must be include, and what may be included. The pseudo code would look like this
That should populate your list in the way you want.
Let me repost one of my older answers:
I once wrote a little program that does this:
It works as follow:
All ? {} + * | () operators are expanded (to a maximal limit), so that only character classes and backreferences remain.
e.g.
[a-c]+|t*|([x-z]){2}foo\1|(a|b)(t|u)
becomes[a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z][x-z])foo\1|at|au|bt|bu
(the | in latter expression are just notation, the program keeps each alternative subregex in a list)
Backreferences to multiple characters are replaced by backreferences to single characters.
e.g. the expression above becomes
[a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z])([x-z])foo\1\2|at|au|bt|bu
Now each alternative subregex matches a fixed length string.
For each of the alternatives, all combinations of picking characters from the classes are printed:
e.g. the expression above becomes
a|b|c|aa|ba|..|cc|aaa|baa|...|ccc|aaaa|...|cccc||t|tt|tt|ttt|ttt|xxfooxx|yxfooyx|...|zzfoozz|at|au|bt|bu
You can just skip step 3 if you just want the count (which is often fast enough because the output of step 2 is usually far shorter than the final output)
You might want to take a look at https://github.com/rhdunn/cainteoir-engine/blob/0c283e798c8141a65060c5e92f462646c2689644/tests/dictionary.py.
I wrote this to support regular expressions in text-to-speech pronunciation dictionaries, but the regex expanding logic is self-contained. You can use it like:
Here, the second parameter is for references (i.e. named blocks) and the endings is for, e.g.
look{s,ed,ing}
How it works ...
lex_expression
splits the string into tokens delimited by the regex tokens[]<>|(){}?
. Thus,a(b|cd)efg
becomes['a', '(', 'b', '|', 'cd', ')', 'efg']
. This makes it easier to parse the regex.The
parse_XYZ_expr
functions (along with the top-levelparse_expr
) parse the regex elements, constructing an object hierarchy that represents the regex. These objects are:a?
)Thus,
ab(cd|e)?
is represented asSequence(Literal('ab'), Optional(Choice(Literal('cd'), Literal('e'))))
.These classes support an
expand
method that has the formexpr.expand(words) => expanded
, e.g.:results in: