I'm using a simple language of only ()
, |
, spaces, and alpha characters.
Given a regular expression like the following:
(hello|goodbye) (world(s|)|)
How would I go about generating the following data?
hello worlds
hello world
hello
goodbye worlds
goodbye world
goodbye
I'm not quite sure if I need to build a tree first, or if it can be done recursively. I'm stuck on what data structures to utilize, and how to generate the strings as I go. Will I have to keep a bunch of markers, and index back into partially built strings to concatenate more data on? I don't know how best to approach this problem. Would I need to read the whole expression first, and re-order it a certain way?
The function signature is going to look the following way:
std::vector<std::string> Generate(std::string const&){
//...
}
What do you suggest I do?
EDIT:
Let me clarify that the results should always be finite here. In my particular example, there are only 6 strings that would ever be true for the expression. I'm not sure if my terminology is correct here, but what I'm looking for, is a perfect match of the expression- not any string that contains a substring which matches.
Somewhat following Kieveli's advice, I have come up with a working solution. Although not previously mentioned, it was important for me to also get a count of how many results could potentially be generated. I was using a python script called "exrex" which I had found on github. Embarrassingly, I did not realize that it had the capability to also count. Nonetheless, I implemented it the best I could in C++ using my simplified regular expression language. If interested in my solution, please read on.
From an object oriented stand point, I wrote a scanner to take the regular expression(string), and convert it into a list of tokens(vector of strings). The list of tokens was then sent to a parser which generated an n-ary tree. All of this was packed inside an "expression generator" class that could take an expression and hold the parse tree, as well as the generated count.
The scanner was important because it tokenized the empty string case which you can see in my question appearing as "|)". Scanning also created a pattern of [word] [operation] [word] [operation] ... [word].
For example, scanning: "(hello|goodbye) (world(s|)|)"
will create: [][(][hello][|][goodbye][)][ ][(][world][(][s][|][][)][][|][][)][]
The parse tree was a vector of nodes. Nodes contain a vector of vector of nodes.
The orange cells represent the "or"s, and the other boxes that draw the connections, represent the "and"s. Below is my code.
Node header
#pragma once
#include <string>
#include <vector>
class Function_Expression_Node{
public:
Function_Expression_Node(std::string const& value_in = "", bool const& more_in = false);
std::string value;
bool more;
std::vector<std::vector<Function_Expression_Node>> children;
};
Node source
#include "function_expression_node.hpp"
Function_Expression_Node::Function_Expression_Node(std::string const& value_in, bool const& more_in)
: value(value_in)
, more(more_in)
{}
Scanner header
#pragma once
#include <vector>
#include <string>
class Function_Expression_Scanner{
public: Function_Expression_Scanner() = delete;
public: static std::vector<std::string> Scan(std::string const& expression);
};
Scanner source
#include "function_expression_scanner.hpp"
std::vector<std::string> Function_Expression_Scanner::Scan(std::string const& expression){
std::vector<std::string> tokens;
std::string temp;
for (auto const& it: expression){
if (it == '('){
tokens.push_back(temp);
tokens.push_back("(");
temp.clear();
}
else if (it == '|'){
tokens.push_back(temp);
tokens.push_back("|");
temp.clear();
}
else if (it == ')'){
tokens.push_back(temp);
tokens.push_back(")");
temp.clear();
}
else if (isalpha(it) || it == ' '){
temp+=it;
}
}
tokens.push_back(temp);
return tokens;
}
Parser header
#pragma once
#include <string>
#include <vector>
#include "function_expression_node.hpp"
class Function_Expression_Parser{
Function_Expression_Parser() = delete;
//get parse tree
public: static std::vector<std::vector<Function_Expression_Node>> Parse(std::vector<std::string> const& tokens, unsigned int & amount);
private: static std::vector<std::vector<Function_Expression_Node>> Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount);
private: static Function_Expression_Node Recursive_Build(std::vector<std::string>::const_iterator & it, int & total); //<- recursive
//utility
private: static bool Is_Word(std::string const& it);
};
Parser source
#include "function_expression_parser.hpp"
bool Function_Expression_Parser::Is_Word(std::string const& it){
return (it != "(" && it != "|" && it != ")");
}
Function_Expression_Node Function_Expression_Parser::Recursive_Build(std::vector<std::string>::const_iterator & it, int & total){
Function_Expression_Node sub_root("",true); //<- contains the full root
std::vector<Function_Expression_Node> root;
const auto begin = it;
//calculate the amount
std::vector<std::vector<int>> multiplies;
std::vector<int> adds;
int sub_amount = 1;
while(*it != ")"){
//when we see a "WORD", add it.
if(Is_Word(*it)){
root.push_back(Function_Expression_Node(*it));
}
//when we see a "(", build the subtree,
else if (*it == "("){
++it;
root.push_back(Recursive_Build(it,sub_amount));
//adds.push_back(sub_amount);
//sub_amount = 1;
}
//else we see an "OR" and we do the split
else{
sub_root.children.push_back(root);
root.clear();
//store the sub amount
adds.push_back(sub_amount);
sub_amount = 1;
}
++it;
}
//add the last bit, if there is any
if (!root.empty()){
sub_root.children.push_back(root);
//store the sub amount
adds.push_back(sub_amount);
}
if (!adds.empty()){
multiplies.push_back(adds);
}
//calculate sub total
int or_count = 0;
for (auto const& it: multiplies){
for (auto const& it2: it){
or_count+=it2;
}
if (or_count > 0){
total*=or_count;
}
or_count = 0;
}
/*
std::cout << "---SUB FUNCTION---\n";
for (auto it: multiplies){for (auto it2: it){std::cout << "{" << it2 << "} ";}std::cout << "\n";}std::cout << "--\n";
std::cout << total << std::endl << '\n';
*/
return sub_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount){
std::vector<std::vector<Function_Expression_Node>> full_root;
std::vector<Function_Expression_Node> root;
const auto begin = it;
//calculate the amount
std::vector<int> adds;
int sub_amount = 1;
int total = 0;
while (it != end){
//when we see a "WORD", add it.
if(Is_Word(*it)){
root.push_back(Function_Expression_Node(*it));
}
//when we see a "(", build the subtree,
else if (*it == "("){
++it;
root.push_back(Recursive_Build(it,sub_amount));
}
//else we see an "OR" and we do the split
else{
full_root.push_back(root);
root.clear();
//store the sub amount
adds.push_back(sub_amount);
sub_amount = 1;
}
++it;
}
//add the last bit, if there is any
if (!root.empty()){
full_root.push_back(root);
//store the sub amount
adds.push_back(sub_amount);
sub_amount = 1;
}
//calculate sub total
for (auto const& it: adds){ total+=it; }
/*
std::cout << "---ROOT FUNCTION---\n";
for (auto it: adds){std::cout << "[" << it << "] ";}std::cout << '\n';
std::cout << total << std::endl << '\n';
*/
amount = total;
return full_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Parse(std::vector<std::string> const& tokens, unsigned int & amount){
auto it = tokens.cbegin();
auto end = tokens.cend();
auto parse_tree = Build_Parse_Tree(it,end,amount);
return parse_tree;
}
Generator header
#pragma once
#include "function_expression_node.hpp"
class Function_Expression_Generator{
//constructors
public: Function_Expression_Generator(std::string const& expression);
public: Function_Expression_Generator();
//transformer
void Set_New_Expression(std::string const& expression);
//observers
public: unsigned int Get_Count();
//public: unsigned int Get_One_Word_Name_Count();
public: std::vector<std::string> Get_Generations();
private: std::vector<std::string> Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree);
private: std::vector<std::string> Sub_Generate(std::vector<Function_Expression_Node> const& nodes);
private:
std::vector<std::vector<Function_Expression_Node>> m_parse_tree;
unsigned int amount;
};
Generator source
#include "function_expression_generator.hpp"
#include "function_expression_scanner.hpp"
#include "function_expression_parser.hpp"
//constructors
Function_Expression_Generator::Function_Expression_Generator(std::string const& expression){
auto tokens = Function_Expression_Scanner::Scan(expression);
m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}
Function_Expression_Generator::Function_Expression_Generator(){}
//transformer
void Function_Expression_Generator::Set_New_Expression(std::string const& expression){
auto tokens = Function_Expression_Scanner::Scan(expression);
m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}
//observers
unsigned int Function_Expression_Generator::Get_Count(){
return amount;
}
std::vector<std::string> Function_Expression_Generator::Get_Generations(){
return Generate(m_parse_tree);
}
std::vector<std::string> Function_Expression_Generator::Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree){
std::vector<std::string> results;
std::vector<std::string> more;
for (auto it: parse_tree){
more = Sub_Generate(it);
results.insert(results.end(), more.begin(), more.end());
}
return results;
}
std::vector<std::string> Function_Expression_Generator::Sub_Generate(std::vector<Function_Expression_Node> const& nodes){
std::vector<std::string> results;
std::vector<std::string> more;
std::vector<std::string> new_results;
results.push_back("");
for (auto it: nodes){
if (!it.more){
for (auto & result: results){
result+=it.value;
}
}
else{
more = Generate(it.children);
for (auto m: more){
for (auto r: results){
new_results.push_back(r+m);
}
}
more.clear();
results = new_results;
new_results.clear();
}
}
return results;
}
In conclusion, I recommend using exrex, or any other programs mentioned in this thread, if you need to generate matches for regular expressions.
When I did my own custom little language, I wrote a parser first. The parser created a structure in memory that represented the text. For this little language, I would create a structure that's something like this:
Node:
list of string values
isRequired
list of child Nodes
When you parse your text, you would get a list of nodes:
Node1:
hello, goodbye
true
[] (no child nodes)
Node2:
world,
false
[
Node3:
s,
false
[]
]
Once you parse into this structure, you can imagine code that'll generate what you want given that you understand what must be include, and what may be included. The pseudo code would look like this
recursiveGenerate( node_list, parital )
if ( node_list is null or is empty )
add partial to an output list
for the first node
if ( ! node.isRequired )
recursiveGenrate( remaining nodes, partial )
for each value
recursiveGenerate( child Nodes + remaining nodes, partial + value )
That should populate your list in the way you want.
You might want to take a look at https://github.com/rhdunn/cainteoir-engine/blob/0c283e798c8141a65060c5e92f462646c2689644/tests/dictionary.py.
I wrote this to support regular expressions in text-to-speech pronunciation dictionaries, but the regex expanding logic is self-contained. You can use it like:
import dictionary
words, endings = dictionary.expand_expression('colou?r', {})
print words
Here, the second parameter is for references (i.e. named blocks) and the endings is for, e.g. look{s,ed,ing}
How it works ...
lex_expression
splits the string into tokens delimited by the regex tokens []<>|(){}?
. Thus, a(b|cd)efg
becomes ['a', '(', 'b', '|', 'cd', ')', 'efg']
. This makes it easier to parse the regex.
The parse_XYZ_expr
functions (along with the top-level parse_expr
) parse the regex elements, constructing an object hierarchy that represents the regex. These objects are:
- Literal -- a literal sequence of one or more characters
- Choice -- any of the sub-expressions in the sequence (i.e. '|')
- Optional -- either the result of the expression, or not (i.e.
a?
)
- Sequence -- the sub-expressions in order
Thus, ab(cd|e)?
is represented as Sequence(Literal('ab'), Optional(Choice(Literal('cd'), Literal('e'))))
.
These classes support an expand
method that has the form expr.expand(words) => expanded
, e.g.:
expr = Optional('cd')
print expr.expand(['ab', 'ef'])
results in:
ab
abcd
ef
efcd
Let me repost one of my older answers:
I once wrote a little program that does this:
It works as follow:
All ? {} + * | () operators are expanded (to a maximal limit), so that only character classes and backreferences remain.
e.g. [a-c]+|t*|([x-z]){2}foo\1|(a|b)(t|u)
becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z][x-z])foo\1|at|au|bt|bu
(the | in latter expression are just notation, the program keeps each alternative subregex in a list)
Backreferences to multiple characters are replaced by backreferences to single characters.
e.g. the expression above becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z])([x-z])foo\1\2|at|au|bt|bu
Now each alternative subregex matches a fixed length string.
For each of the alternatives, all combinations of picking characters from the classes are printed:
e.g. the expression above becomes a|b|c|aa|ba|..|cc|aaa|baa|...|ccc|aaaa|...|cccc||t|tt|tt|ttt|ttt|xxfooxx|yxfooyx|...|zzfoozz|at|au|bt|bu
You can just skip step 3 if you just want the count (which is often fast enough because the output of step 2 is usually far shorter than the final output)