Distinguishing identifiers from common strings

2019-09-14 02:33发布

问题:

I want to write a parser using Bison/Yacc + Lex which can parse statements like:

VARIABLE_ID = 'STRING' 

where:

ID       [a-zA-Z_][a-zA-Z0-9_]*

and:

STRING      [a-zA-Z0-9_]+

So, var1 = '123abc' is a valid statement while 1var = '123abc' isn't.

Therefore, a VARIABLE_ID is a STRING but a STRING not always is a VARIABLE_ID.

What I would like to know is if the only way to distinguish between the two is writing a checking procedure at a higher level (i.e. inside Bison code) or if I can work it out in the Lex code.

回答1:

Your abstract statement syntax is actually:

VARIABLE = STRING

and not

VARIABLE = 'STRING'

because the quote delimiters are a lexical detail that we generally want to keep out of the syntax. And so, the token patterns are actually this:

ID       [a-zA-Z_][a-zA-Z0-9_]*
STRING   '[a-zA-Z_0-9]*'

An ID is a letter or underscore, followed by any combination (including empty) of letters, digits and underscores.

A STRING is a single quote, followed by a sequence (possibly empty) letters, digits and underscores, followed by another single quote.

So the ambiguity you are concerned about does not exist. An ID is not in fact a STRING, nor vice versa.

Somewhere inside your Bison parser, or possibly in the lexer, you might want to massage the yytext of a STRING match to remove the quotes and just retain the text in between them as a string. This could be a Bison rule, possibly similar to:

string : STRING 
       {
          $$ = strip_quotes($1);
       }
       ;