I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx..
('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]*
and the number of digits following the L
is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM
token to be tokenized as IDENTIFIER
s.
Is there a way to properly tokenize this input using an ANTLR grammar?