Understanding ANTLR4 Tokens

2019-09-02 19:25发布

问题:

I'm pretty new to ANTLR and I'm trying to understand what exactly Token is in ATNLR4. Consider the following pretty nonsensical grammar:

grammar Tst;

init: A token=('+'|'-') B;

A: .+?;
B: .+?;
ADD: '+';
SUB: '-';

ANTLR4 generates the following TstParser.InitContext for it:

public static class InitContext extends ParserRuleContext {
        public Token token;       //<---------------------------- HERE
        public TerminalNode A() { return getToken(TstParser.A, 0); }
        public TerminalNode B() { return getToken(TstParser.B, 0); }
        public InitContext(ParserRuleContext parent, int invokingState) {
            super(parent, invokingState);
        }
        @Override public int getRuleIndex() { return RULE_init; }
        @Override
        public void enterRule(ParseTreeListener listener) {
            if ( listener instanceof TstListener ) ((TstListener)listener).enterInit(this);
        }
        @Override
        public void exitRule(ParseTreeListener listener) {
            if ( listener instanceof TstListener ) ((TstListener)listener).exitInit(this);
        }
    }

Now, all lexer rules are available as static constants in the parser class:

public static final int A=1, B=2, ADD=3, SUB=4;

How can we us them to identify lexer rules? All A, B, and ADD rules may match '+'. So what type should I use when testing it.

I mean this:

TstParser.InitContext ctx;
//...
ctx.token.getType() == //What type?
                       //TstParse.A
                       //TstParse.B
                       //or
                       //TstParse.ADD?

In general, I would like to learn how ANTLR4 knows the type of a Token?

回答1:

I will try to introduce you to the process of parsing. There are two stages of the process. Lexer part (where tokens are created) and parser part. (This is where parsing expression comes from - not very precise if we are talking about parsing in general). All you are trying to do in the process is to understand the input and meanwhile maybe create a model of the input. To ease this, job is generally divided into smaller steps. It is much easier to understand tokens (somewhat bigger elements of input than characters) represented mainly as "words". (Keywords, variables, literals to be precise).

Because of this the first step you do is to pre-process the input in the form of character stream into TOKENS. All you can say about the token is what value is connected with it and what kind of token it is. For instance in very simple calculator input "2+3*9" '2' represents number token of value 2, '+' represents operator toke of value '+' and so on... The result of lexer part is stream of tokens. As you can imagine, lexer and parser rules are very similar. First step works with characters, second step works with tokens.

Regarding ANTLR (many other generators works the same), there is one important rule regarding lexer. You cannot have the same rule for different tokens. So the grammar you have inserted wont work as the lexer part cannot differ between A and B. You can just use the same token name for both sides. You will take care of it later.

Why cannot lexer rules be the same? As the lexer process the input, it walks the stream. It tries the first lexer rule it finds and if it is ok it will apply it. So if there is another rule that would apply as well, hm, what a pitty. It would not get a chance. Parser is much more generous in ANTLR than lexer.

To sum it up. Tokens are products of lexer, they are groups of one or more characters that should be presented to next step as a single unit. We are taling about variable names, operators, function names etc.