Ordering lexer rules in a grammar using ANTLR4

2019-08-16 17:53发布

问题:

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.

I want the parser to be able to handle something like this:

Hello << name >>, how are you?

At runtime I will replace "<< name >>" with the user's name.

So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.

Here is my grammar:

doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;

WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".

If I run this parser on the above sentence, I get a parse tree that looks like this:

Anything highlighted in red is a parse error.

So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.

If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

And run the parser, I get a parse tree like this:

So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.

How do I get past this conundrum?

I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).

Thanks for any help!

回答1:

From The Definitive ANTLR 4 Reference :

ANTLR resolves lexical ambiguities by matching the input string to the rule specified first in the grammar.

With your grammar (in Question.g4) and a t.text file containing

Hello << name >>, how are you at nine o'clock?

the execution of

$ grun Question doc -tokens -diagnostics t.text

gives

[@0,0:4='Hello',<WORD>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<WORD>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<WORD>,1:18]
[@6,22:24='are',<WORD>,1:22]
[@7,26:28='you',<WORD>,1:26]
[@8,30:31='at',<WORD>,1:30]
[@9,33:36='nine',<WORD>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}

Now change WORD to word in the item rule, and add a word rule :

item: (func | word) PUNCT? ;
word: WORD | ID ;

and put ID before WORD :

ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;

The tokens are now

[@0,0:4='Hello',<ID>,1:0]
[@1,6:7='<<',<'<<'>,1:6]
[@2,9:12='name',<ID>,1:9]
[@3,14:15='>>',<'>>'>,1:14]
[@4,16:16=',',<PUNCT>,1:16]
[@5,18:20='how',<ID>,1:18]
[@6,22:24='are',<ID>,1:22]
[@7,26:28='you',<ID>,1:26]
[@8,30:31='at',<ID>,1:30]
[@9,33:36='nine',<ID>,1:33]
[@10,38:44='o'clock',<WORD>,1:38]
[@11,45:45='?',<PUNCT>,1:45]
[@12,47:46='<EOF>',<EOF>,2:0]

and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.



回答2:

As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.

In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.

However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:

start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;

Or you could skip the UNINTERESTING completely.