I want to write a parser using Bison/Yacc
+ Lex
which can parse statements like:
VARIABLE_ID = 'STRING'
where:
ID [a-zA-Z_][a-zA-Z0-9_]*
and:
STRING [a-zA-Z0-9_]+
So, var1 = '123abc'
is a valid statement while 1var = '123abc'
isn't.
Therefore, a VARIABLE_ID
is a STRING
but a STRING
not always is a VARIABLE_ID
.
What I would like to know is if the only way to distinguish between the two is writing a checking procedure at a higher level (i.e. inside Bison
code) or if I can work it out in the Lex
code.
Your abstract statement syntax is actually:
and not
because the quote delimiters are a lexical detail that we generally want to keep out of the syntax. And so, the token patterns are actually this:
An
ID
is a letter or underscore, followed by any combination (including empty) of letters, digits and underscores.A
STRING
is a single quote, followed by a sequence (possibly empty) letters, digits and underscores, followed by another single quote.So the ambiguity you are concerned about does not exist. An
ID
is not in fact aSTRING
, nor vice versa.Somewhere inside your Bison parser, or possibly in the lexer, you might want to massage the
yytext
of aSTRING
match to remove the quotes and just retain the text in between them as a string. This could be a Bison rule, possibly similar to: