I want to write a parser using Bison/Yacc
+ Lex
which can parse statements like:
VARIABLE_ID = 'STRING'
where:
ID [a-zA-Z_][a-zA-Z0-9_]*
and:
STRING [a-zA-Z0-9_]+
So, var1 = '123abc'
is a valid statement while 1var = '123abc'
isn't.
Therefore, a VARIABLE_ID
is a STRING
but a STRING
not always is a VARIABLE_ID
.
What I would like to know is if the only way to distinguish between the two is writing a checking procedure at a higher level (i.e. inside Bison
code) or if I can work it out in the Lex
code.
Your abstract statement syntax is actually:
VARIABLE = STRING
and not
VARIABLE = 'STRING'
because the quote delimiters are a lexical detail that we generally want to keep out of the syntax. And so, the token patterns are actually this:
ID [a-zA-Z_][a-zA-Z0-9_]*
STRING '[a-zA-Z_0-9]*'
An ID
is a letter or underscore, followed by any combination (including empty) of letters, digits and underscores.
A STRING
is a single quote, followed by a sequence (possibly empty) letters, digits and underscores, followed by another single quote.
So the ambiguity you are concerned about does not exist. An ID
is not in fact a STRING
, nor vice versa.
Somewhere inside your Bison parser, or possibly in the lexer, you might want to massage the yytext
of a STRING
match to remove the quotes and just retain the text in between them as a string. This could be a Bison rule, possibly similar to:
string : STRING
{
$$ = strip_quotes($1);
}
;