Preface
I'm trying to write my own context-free-grammar specification, to associate with the rules of my lexer/parser. It is meant to be similar to that of ANTLR's, where upper-case identifiers classify as a Lexer rule and lower-case identifiers classify as a Parser rule. It is meant to accept any combination of string literals and/or regular expressions for lexer rules, and any combination of lexer/regex rules and/or other parser identifiers for parser rules. Each rule in is the format of <identifier>:<expression>;
Here's an example of the grammar:
integer : DIGIT+; //parser rule containing at least one lexer rule
twodigits : DIGIT DIGIT; //parser rule containing two consecutive lexer rules
DIGIT : [0-9]; //lexer rule containing regex
string : '"' CHAR* '"'; //parser rule containing zero or more
// lexer rules, wrapped in two string literals
CHAR : (LCHAR|UCHAR); //lexer rule containing two lexer rules which
// will later evaluate to one of two tokens
LCHAR : [a-z]; //lexer rule containing regex
UCHAR : [A-Z]; //lexer rule containing regex
SPACE : ' '; //lexer rule containing string literal
Problem
The trouble I'm having is the ability to match the expression strings, since their contents tend to vary.
I have originally written:
([a-zA-Z0-9_]*)(?:\s*)(?:\:)(?:\s*)((?:\'?).*(?:\'?)(?:\;))
as the match rule, which does okay for a single string literal expression surrounded by single quotes, but I need to expand this to allow for multiple non-greedy string literals, and combined statements separated by any number of whitespace. I am not concerned with matching potential regex's within a matched expression, or even capturing segregated parts of the expression, as this is handled later on by a separate regex operation, so really I just need to validate identifiers and expressions...
All in all, I need the regex_search operation to look through the grammar's contents, using the following syntax for matches:
- A valid identifier, starting with one or more lower or uppercase letters, optionally followed by any number of alphanumeric characters (which optionally can contain any number of underscore characters inbetween, as long as the identifier does not start or end with one).
- Any number of whitespace characters, tabs, newlines etc, without capturing it.
- A colon without capturing it.
- Any number of whitespace characters, tabs, newlines etc, without capturing it.
- At least one of: (in any order) any number of string literals (enclosed in single quotes, without capturing the quotes), any number of lexer/parser identifiers, any number of regex's (enclosed in square brackets). The result of this match rule should capture the entire expression as a single string, which will later go through a post-processing stage.
- Any number of whitespace characters, tabs, newlines etc, without capturing it.
- A semicolon optionally followed by any uncaptured whitespace.
- Optionally, any number of uncaptured spaces followed by a single captured line comment
- Any number of whitespace characters, tabs, newlines etc, without capturing it.
Question
Is it possible to place this into a single regex_search operation?
I've messed around in Expresso and just can't seem to get it right...
Update
So far, I've been able to come up with the following:
#/////////////////////
# Identifier
#/////////////////////
(
(?:[a-zA-Z]+) # At least one lower/uppercase letter
(?:
(?:[a-zA-Z0-9_]*) # Zero or more alphanumeric/underscore characters,
(?:\w+) # explicitly followed by one or more alphanumeric
)? # characters
)
#/////////////////////
# Separator
#/////////////////////
(?:\s*) # Any amount of uncaptured whitespace
(?:\:) # An uncaptured colon
(?:\s*) # Any amount of uncaptured whitespace
#///////////////////////
# Expression
#///////////////////////
(
# String Literals:
(?:\'?) # An optional single quote,
(?: # which is meant to start and end a string
(?:[^'\\] | \\.)* # literal, but issues several problems for
) # me (see comments below, after this code block)
(?:\'?)
# Other expressions
# ????????????
)
#/////////////////////
# Line End
#/////////////////////
(?:\s*) # Any amount of uncaptured whitespace
(?:\;) # An uncaptured colon
(?:\s*) # Any amount of uncaptured whitespace
As you can see, I have identifiers, separators and line-ends working perfectly. But expressions are where I'm totally stuck!
How can I tell the regex library that I want EITHER a non-greedy string literal, OR any set of characters before the Line End, AND any number of them in any order?
Even if I only allowed a single string literal, how would I say "The closing single quote is NOT optional if the first one exists"?
It might not be flawless, and may require additional coding in how match results are handled, but this appears to work: