What is the ANTLR4 equivalent of a ! in a lexer ru

2019-07-23 23:20发布

问题:

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.

STRING :
    '\''!
    (
        ~('\'' | '\\' | '\r' | '\n')
    )*
    '\''!
    ;

This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.

ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?

回答1:

Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.

Perhaps the simplest way to accomplish the equivalent is:

string : str=STRING { Strings.unquote($str); } ; 
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;

where Strings.unquote is:

public static void unquote(Token token) {
    CommonToken ct = (CommonToken) token;
    String text = ct.getText();
    text = .... unquote it ....
    ct.setText(text);
}

The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.

An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.



回答2:

I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.

Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):

lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;

mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string


标签: antlr antlr4