What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.
E.g.:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name ';' ;
library_name: IDENTIFIER;
module: MODULE module_name ';' ;
module_name: IDENTIFIER;
MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;
The generated tokens are:
T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3
If I am not interested in the 'library'
"token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY
and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?
The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.
There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.
When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say
ExplicitRegExp
becomes 1 andExplicitFixed
becomes 2. But the parser will also need theimplicit
tokens to be able to parse the grammar correctly, so theimplicit
token is assigned the code 3 implicitly.How is that bad? You may have typos in different parts of the grammar:
And your grammar will not work as expected, because
implcit
will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, likeT___0
. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.
To let Antlr be happy, do it the verbose way and declare all of your tokens:
Then it makes no difference if you reference a fixed token by it's explicit name (like
MODULE
) or by its content (like'module'
).Actually, there is a difference between implicit and explicit tokens:
From "The Definitive ANTLR4 Reference", page 76:
Highlight from me.