How to have both function calls and parenthetical

2019-07-22 01:00发布

问题:

Is there any way to specify a grammar which allows the following syntax:

f(x)(g, (1-(-2))*3, 1+2*3)[0]

which is transformed into (in pseudo-lisp to show order):

(index 
  ((f x)
    g 
    (* (- 1 -2) 3)
    (+ (* 2 3) 1)
  ) 
  0
)

along with things like limited operator precedence etc.


The following grammar works with backtrack = true, but I'd like to avoid that:

grammar T;

options {
 output=AST;
 backtrack=true;
 memoize=true;
}

tokens {
  CALL;
  INDEX;
  LOOKUP;
}

prog: (expr '\n')* ;

expr : boolExpr;

boolExpr
 : relExpr (boolop^ relExpr)?
 ;

relExpr
 : addExpr (relop^ addExpr)?
 | a=addExpr oa=relop b=addExpr ob=relop c=addExpr
   -> ^(LAND ^($oa $a $b) ^($ob $b $c))
 ;


addExpr
 : mulExpr (addop^ mulExpr)?
 ;

mulExpr
 : atomExpr (mulop^ atomExpr)?
 ;

atomExpr
 : INT
 | ID
 | OPAREN expr CPAREN -> expr
 | call
 ;

call
 : callable ( OPAREN (expr (COMMA expr)*)? CPAREN -> ^(CALL callable expr*)
            | OBRACK expr CBRACK -> ^(INDEX callable expr)
            | DOT ID -> ^(INDEX callable ID)
            )
 ;

fragment
callable
 : ID
 | OPAREN expr CPAREN
 ;

fragment
boolop
 : LAND | LOR
 ;

fragment
relop
 : (EQ|GT|LT|GTE|LTE)
 ;

fragment
addop
 : (PLUS|MINUS)
 ;

fragment
mulop
 : (TIMES|DIVIDE)
 ;

EQ : '==' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;

LAND : '&&' ;
LOR : '||' ;

PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;

ID : ('a'..'z')+ ;
INT : '0'..'9' ;

OPAREN : '(' ;
CPAREN : ')' ;
OBRACK : '[' ;
CBRACK : ']' ;
DOT : '.' ;

COMMA : ',' ;

回答1:

There are a couple of things wrong with your grammar:

1

Only lexer rules can be fragments, not parser rules. Some ANTLR targets simply ignore the fragment keyword in front of parser rules (like the Java target), but better just remove them from your grammar: if you decide to create a parser for a different target-language, you may run into problems because of it.

2

Without the backtrack=true, you cannot mix tree-rewrite operators (^ and !) and rewrite rules (->) because you need to create a single alternative inside relExpr instead of the two alternatives you now have (this is to eliminate an ambiguity).

In your case, you can't create the desired AST with just ^ (inside a single alternative), so you'll need to do it like this:

relExpr
 : (a=addExpr -> $a) ( (oa=relOp b=addExpr    -> ^($oa $a $b))
                         ( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
                         )?
                     )?
 ;

(yes, I know, it's not particularly pretty, but that can't be helped AFAIK)

Also, you can only put the LAND token in the rewrite rules if it is defined in the tokens { ... } block:

tokens {
  // literal tokens
  LAND='&&';
  ...

  // imaginary tokens
  CALL;
  ...
}

Otherwise you can only use tokens (and other parser rules) in rewrite rules if they really occur inside the parser rule itself.

3

You did not account for the unary minus in your grammar, implement it like this:

mulExpr
 : unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
 ;

unaryExpr
 : MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
 | atomExpr
 ;

Now, to create a grammar that does not need backtrack=true, remove the ID and '(' expr ')' from your atomExpr rule:

atomExpr
 : INT
 | call
 ;

and make everything passed callable optional inside your call rule:

call
 : (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
                          | OBRACK expr CBRACK   -> ^(INDEX $call expr)
                          | DOT ID               -> ^(INDEX $call ID)
                          )*
 ;

That way, ID and '(' expr ')' are already matched by call (and there's no ambiguity).


Taken all the remarks above into account, you could get the following grammar:

grammar T;

options {
  output=AST;
}

tokens {
 // literal tokens
 EQ     = '==' ;
 GT     = '>' ;
 LT     = '<' ;
 GTE    = '>=' ;
 LTE    = '<=' ;
 LAND   = '&&' ;
 LOR    = '||' ;
 PLUS   = '+' ;
 MINUS  = '-' ;
 TIMES  = '*' ;
 DIVIDE = '/' ;
 OPAREN = '(' ;
 CPAREN = ')' ;
 OBRACK = '[' ;
 CBRACK = ']' ;
 DOT    = '.' ;
 COMMA  = ',' ;

 // imaginary tokens
 CALL;
 INDEX;
 LOOKUP;
 UNARY_MINUS;
 PARAMS;
}

prog
 : expr EOF -> expr
 ;

expr
 : boolExpr
 ;

boolExpr
 : relExpr ((LAND | LOR)^ relExpr)?
 ;

relExpr
 : (a=addExpr -> $a) ( (oa=relOp b=addExpr    -> ^($oa $a $b))
                         ( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
                         )?
                     )?
 ;

addExpr
 : mulExpr ((PLUS | MINUS)^ mulExpr)*
 ;

mulExpr
 : unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
 ;

unaryExpr
 : MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
 | atomExpr
 ;

atomExpr
 : INT
 | call
 ;

call
 : (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
                          | OBRACK expr CBRACK   -> ^(INDEX $call expr)
                          | DOT ID               -> ^(INDEX $call ID)
                          )*
 ;

callable
 : ID
 | OPAREN expr CPAREN -> expr
 ;

params
 : (expr (COMMA expr)*)? -> ^(PARAMS expr*)
 ;

relOp
 : EQ | GT | LT | GTE | LTE
 ;

ID     : 'a'..'z'+ ;
INT    : '0'..'9'+ ;
SPACE  : (' ' | '\t') {skip();};

which would parse the input "a >= b < c" into the following AST:

and the input "f(x)(g, (1-(-2))*3, 1+2*3)[0]" as follows:



标签: antlr antlr3