ANTLR 4 lexer tokens inside other tokens

2019-03-11 15:12发布

问题:

I have the following grammar for ANTLR 4:

grammar Pattern;

//parser rules
parse   : string LBRACK CHAR DASH CHAR RBRACK ;
string  : (CHAR | DASH)+ ;

//lexer rules
DASH    : '-' ;
LBRACK  : '[' ;
RBRACK  : ']' ;
CHAR    : [A-Za-z0-9] ;

And I'm trying to parse the following string

ab-cd[0-9]

The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:

"ab-cd" "[" "0" "-" "9" "]"

instead of these

"ab" "-" "cd" "[" "0" "-" "9" "]"

I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?

回答1:

In ANTLR 4, you can use lexer modes for this.

STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);

mode CharSet;

DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;

After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.



标签: antlr4