ANTLR4 lexer not resolving ambiguity in grammar or

2019-07-13 14:23发布

问题:

Using ANTLR 4.2, I'm trying a very simple parse of this test data:

RRV0#ABC

Using a minimal grammar:

grammar Tiny;

thing : RRV N HASH ID ;

RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard

I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:

BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter

Running the ANTLR4 test rig with the test data above, the output is

[@0,0:3='RRV0',<4>,1:0]
[@1,4:4='#',<3>,1:4]
[@2,5:7='ABC',<4>,1:5]
[@3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'

I can see the first token is <4> for ID, with the value 'RRV0'

I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.

If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.

I started in ANTLR 4.1 with the same issue.

I checked in ANTLRWorks and from the command line, with the same result both ways.

How can I change the grammar to match lexer item RRV in preference to ID ?

回答1:

The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.

This strategy is especially important in languages like Java. Consider the following input:

String className = "";

Along with the following two grammar rules (slightly simplified):

CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;

If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.



标签: antlr4