ANTLR chaining 1 to 1 grammar rules together to so

2019-05-15 09:05发布

问题:

If you look at the ObjectiveC antlr v3 grammars (http://www.antlr3.org/grammar/1212699960054/ObjectiveC2ansi.g), and many of the other popular grammars out there they do a similar structure to this for solving conditionals

conditional_expression : logical_or_expression 
  ('?' logical_or_expression ':' logical_or_expression)? ;

constant_expression : conditional_expression ;

logical_or_expression : logical_and_expression 
  ('||' logical_and_expression)* ;

logical_and_expression : inclusive_or_expression 
  ('&&' inclusive_or_expression)* ;

inclusive_or_expression : exclusive_or_expression 
  ('|' exclusive_or_expression)* ;

exclusive_or_expression : and_expression ('^' and_expression)* ;

and_expression : equality_expression ('&' equality_expression)* ;

equality_expression : relational_expression 
  (('!=' | '==') relational_expression)* ;

relational_expression : shift_expression
 (('<' | '>' | '<=' | '>=') shift_expression)* ;

shift_expression : additive_expression (('<<' | '>>') additive_expression)* ;

additive_expression : multiplicative_expression
  (('+' | '-') multiplicative_expression)* ;

multiplicative_expression : cast_expression 
  (('*' | '/' | '%') cast_expression)* ;

cast_expression : '(' type_name ')' cast_expression | unary_expression ;

unary_expression 
  : postfix_expression
  | '++' unary_expression
  | '--' unary_expression
  | unary_operator cast_expression
  | 'sizeof' ('(' type_name ')' | unary_expression) ;

unary_operator : '&' | '*' | '-' | '~' | '!' ;

If you read it you'll notice they do this very long 1 to 1 chain of conditionals from conditional_expression to logical_or_expression to logical_and_expression to inclusive_or_expression to exclusive_or_expression.

Now, I am quite naive when it comes to ANTLR but this strikes me as an odd way to parse conditionals. It seems very complicated for the definition of a logical_or_expression to twist through every single other conditional expression type. Afterall, what does the definition of a logical OR have to do with a left bitwise shift?

Is there perhaps a better way, or is there a specific reason this method is required?

回答1:

As already mentioned, the "chain" is needed to properly handle operator precedence. Without it, input like 1+2*3 would be parsed as:

     *
    / \
   +   3
  / \
 1   2

instead of:

  +
 / \
1   *
   / \
  2   3

Since ANTLR 4 supports direct left recursive rules:

foo
 : foo '?' foo
 | TOKEN
 ;

so not indirect left recursive rules:

foo
 : bar
 | TOKEN
 ;

bar
 : foo '?' foo
 ;

You can rewrite these rules as follows:

expression
 : '-' expression
 | '(' type_name ')' expression
 | expression ('*' | '/' | '%') expression
 | expression ('+' | '-') expression
 | expression ('<<' | '>>') expression
 | expression ('<' | '>' | '<=' | '>=') expression
 | expression ('!=' | '==') expression
 | expression '&' expression
 | expression '^' expression
 | expression '|' expression
 | expression '&&' expression
 | expression '||' expression
 | expression '?' expression ':' expression
 | IDENTIFIER
 | NUMBER
 ;

If the parser now stumbles upon an expression, it will first look for ('*' | '/' | '%'), and if that's not there, it will look for ('+' | '-'), etc. In other words, the alternatives placed first in the rule will get precedence over alternatives placed lower in the rule.

Now I know from your earlier question, Once grammar is complete, what's the best way to walk an ANTLR v4 tree?, that you're using a listener to "walk" the tree. If you create an expression rule as I just showed, you'd need to do a lot of manual inspections in your enterExpression(...) and exitExpression(...) methods to find out which of the alternatives matched an expression. This is where "labels" come in handy. You simply label each alternative in your expression rule:

expression
 : '-' expression                                  #unaryExpr
 | '(' type_name ')' expression                    #castExpr
 | expression ('*' | '/' | '%') expression         #multExpr
 | expression ('+' | '-') expression               #addExpr
 | expression ('<<' | '>>') expression             #...
 | expression ('<' | '>' | '<=' | '>=') expression 
 | expression ('!=' | '==') expression
 | expression '&' expression
 | expression '^' expression
 | expression '|' expression
 | expression '&&' expression
 | expression '||' expression
 | expression '?' expression ':' expression
 | IDENTIFIER
 | NUMBER
 ;

(note that when you label one, you must label them all!)

And then the base listener class will have enter- and exit method for all alternatives:

public void enterUnaryExpr(...)
public void exitUnaryExpr(...)

public void enterCastExpr(...)
public void exitCastExpr(...)

public void enterMultExpr(...)
public void exitMultExpr(...)

...


回答2:

There is a very good reason for doing it this way: operator precedence. Taking your example of the logical OR and left bitwise shift, think about something like

if (a << b || c)

Objective-C precedence rules say that the '<<' has precedence, so the correct way to evaluate this is

(a << b) || c

The parser rules manage this by using the chain you mention, because the rule for '||' is higher up in the chain, the parse correctly gives a << b as a subexpression for the || operator.

There is no better way in Antl3, however in Antlr4, there is, as Antlr4 allows directly left recursive rules. I highly recommend the "Definitive Antlr4 reference" as it has a very good explanation of this issue.