ANTLR parser for alpha numeric words which may hav

2019-08-17 01:45发布

问题:

First I tried to identify a normal word and below works fine:

grammar Test;

myToken: WORD;
WORD: (LOWERCASE | UPPERCASE )+ ;
fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT: '0'..'9' ;
WHITESPACE  : (' ' | '\t')+;

Just when I added below parser rule just beneath "myToken", even my WORD tokens weren't getting recognised with input string as "abc"

ALPHA_NUMERIC_WS: ( WORD | DIGIT | WHITESPACE)+;

Does anyone have any idea why is that?

回答1:

This is because ANTLR's lexer matches "first come, first serve". That means it will tray to match the given input with the first specified (in the source code) rule and if that one can match the input, it won't try to match it with the other ones.

In your case ALPHA_NUMERIC_WS does match the same content as WORD (and more) and because it is specified before WORD, WORD will never be used to match the input as there is no input that can be matched by WORD that can't be matched by the first processed ALPHA_NUMERIC_WS. (The same applies for the WS and the DIGIT) rule.

I guess that what you want is not to create a ALPHA_NUMERIC_WS-token (as is done by specifying it as a lexer rule) but to make it a parser rule instead so it then can be referenced from another parsre rule to allow an arbitrary sequence of WORDs, DIGITs and WSs.

Therefore you'd want to write it like this:

alpha_numweric_ws: ( WORD | DIGIT | WHITESPACE)+;

If you actually want to create the respective token you can either remove the following rules or you need to think about what a lexer's job is and where to draw the line between lexer and parser (You need to redesign your grammar in order for this to work).