how to resolve an ambiguity

2019-08-16 16:15发布

问题:

I have a grammar:

grammar Test;

s      : ID OP (NUMBER | ID);

ID     : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;

OP     : '/.' | '/' ;
WS     : [ \t\r\n]+ -> skip ;

An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.

Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.

I am new to ANTLR.

回答1:

An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)

I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.

So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).

ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.

Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.

Grammar Question.g4 :

grammar Question;

/* Parse Wolfram ReplaceAll. */

question
@init {System.out.println("Question last update 0851");}
    :   s+ EOF
    ;

s   :   division
    |   replace_all
    ;

division
    :   expr '/' NUMBER
        {System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
    ;

replace_all
    :   expr '/' '.' replacement
        {System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
    ;

expr
    :   ID
    |   '"' ID '"'
    |   NUMBER
    |   '{' expr ( ',' expr )* '}'
    ;

replacement
    :   expr '->' expr    
    |   '{' replacement ( ',' replacement )* '}'
    ;

ID     : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS     : [ \t\r\n]+ -> skip ;

Input file t.text :

x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4 
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text 
[@0,0:0='x',<ID>,1:0]
[@1,1:1='/',<'/'>,1:1]
[@2,2:5='.123',<NUMBER>,1:2]
[@3,7:7='x',<ID>,2:0]
[@4,8:8='/',<'/'>,2:1]
[@5,9:9='.',<'.'>,2:2]
[@6,10:10='x',<ID>,2:3]
[@7,12:13='->',<'->'>,2:5]
[@8,15:15='1',<NUMBER>,2:8]
[@9,17:17='{',<'{'>,3:0]
...
[@29,47:47='}',<'}'>,4:5]
[@30,48:48='/',<'/'>,4:6]
[@31,49:50='.0',<NUMBER>,4:7]
...
[@40,67:67='}',<'}'>,5:5]
[@41,68:68='/',<'/'>,5:6]
[@42,69:69='.',<'.'>,5:7]
[@43,71:71='0',<NUMBER>,5:9]
...
[@48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"

The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.

(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll



标签: antlr antlr4