ANTLR4: implicit or explicit token definition

2019-04-14 10:50发布

问题:

What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.

E.g.:

grammar SimpleTest;

top: library | module ;

library: 'library' library_name ';' ;
library_name: IDENTIFIER;         

module: MODULE module_name ';' ;
module_name: IDENTIFIER;

MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;

The generated tokens are:

T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3

If I am not interested in the 'library' "token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?

回答1:

The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.

There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.

ExplicitRegExp: [A-Z][a-z]+; // lexer rule starts with uppercase letter
ExplicitFixed: 'fixed';
parserRule: 'implicit' ExplicitRegExp; // parser rules starts with lowercase letter

When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say ExplicitRegExp becomes 1 and ExplicitFixed becomes 2. But the parser will also need the implicit tokens to be able to parse the grammar correctly, so the implicit token is assigned the code 3 implicitly.

How is that bad? You may have typos in different parts of the grammar:

a : 'implicit' c;
b : 'implcit' d; // typo here

And your grammar will not work as expected, because implcit will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, like T___0. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).

The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.

To let Antlr be happy, do it the verbose way and declare all of your tokens:

grammar SimpleTest;

top: library | module ;

library: 'library' library_name=IDENTIFIER ';' ; // I'm using aliasing instead of different parser rule here, just a preference

module: 'module' module_name=IDENTIFIER ';' ;

MODULE: 'module' ;
LIBRARY: 'library' ;
IDENTIFIER: [a-zA-Z0-9]+;

Then it makes no difference if you reference a fixed token by it's explicit name (like MODULE) or by its content (like 'module').



回答2:

Actually, there is a difference between implicit and explicit tokens:

From "The Definitive ANTLR4 Reference", page 76:

ANTLR collects and separates all of the string literals and lexer rules from the parser rules. Literals such as 'enum' become lexical rules and go immediately after the parser rules but before the explicit lexical rules.

ANTLR lexers resolve ambiguities between lexical rules by favoring the rule specified first.

Highlight from me.