Token recognition error: antlr

2019-06-24 06:50发布

问题:

I have an ANTLR 4 grammar:

grammar Test;

start : NonZeroDigit '.' Digit Digit? EOF
      ;

DOT            :    '.'  ;
PLUS           :    '+'  ;
MINUS          :    '-'  ;
COLON          :    ':'  ;
COMMA          :    ','  ;
QUOTE          :    '\"' ;
EQUALS         :    '='  ;
SEMICOLON      :    ';'  ;
UNDERLINE      :    '_'  ;
BACKSLASH      :    '\\' ;
SINGLEQUOTE    :    '\'' ;

RESULT_TYPE_NONE          :    'NONE'       ;
RESULT_TYPE_RESULT        :    'RESULT'     ;
RESULT_TYPE_RESULT_SET    :    'RESULT_SET' ;

TYPE_INT       :    'Int'    ;
TYPE_LONG      :    'Long'   ;
TYPE_BOOL      :    'Bool'   ;
TYPE_DATE      :    'Date'   ;
TYPE_DOUBLE    :    'Double' ;
TYPE_STRING    :    'String' ;

TYPE_INT_LIST       :    'List<Int>'   ;
TYPE_LONG_LIST      :    'List<Long>'   ;
TYPE_BOOL_LIST      :    'List<Bool>'   ;
TYPE_DATE_LIST      :    'List<Date>'   ;
TYPE_DOUBLE_LIST    :    'List<Double>' ;
TYPE_STRING_LIST    :    'List<String>' ;

LONG_END      :    'L' ;
DOUBLE_END    :    'd' ;

DATE_NOW      :    'NOW'   ;
BOOL_TRUE     :    'true'  ;
BOOL_FALSE    :    'false' ;

BLOCK_OPEN       :    '{' ;
BLOCK_CLOSE      :    '}' ;
GENERIC_OPEN     :    '<' ;
GENERIC_CLOSE    :    '>' ;
BRACKET_OPEN     :    '(' ;
BRACKET_CLOSE    :    ')' ;

MAP      :    'Map'   ;
LIST     :    'List'  ;
GROUP    :    'Group' ;

BY             :    'by'         ;
DEFAULT        :    'default'    ;
JSON_NAME      :    'JSONName'   ;
INTERFACE      :    'interface'  ;
CLASS          :    'class'      ;
ABSTRACT       :    'abstract'   ;
IMPLEMENTS     :    'implements' ;
EXTENDS        :    'extends'    ;
CACHEABLE      :    'cacheable'  ;
FUNCTION       :    'function'   ;
REQUEST        :    'request'    ;
NAMED_QUERY    :    'namedQuery' ;
INPUT          :    'input'      ;
OUTPUT         :    'output'     ;
RESULT_TYPE    :    'resultType' ;
PACKAGE        :    'package'    ;
SCHEMA         :    'schema'     ;
VERSION        :    'version'    ;
MIN_VERSION    :    'minVersion' ;

fragment
NonZeroDigit : [1-9]
             ;

fragment
Digit : '0' | NonZeroDigit
      ;

fragment
Digits : Digit+
       ;

fragment
IntegerNumber : '0' | ( NonZeroDigit Digits? )
              ;

fragment
SignedIntegerNumber : ( '+' | '-' )? IntegerNumber
                    ;

fragment
FloatingNumber : IntegerNumber ( '.' Digits )?
               ;

fragment
SignedFloatingNumber : ( '+' | '-' )? FloatingNumber
                     ;

fragment
Letter : [a-z]
       ;

fragment
Letters : Letter+
        ;

fragment
CapitalLetter : [A-Z]
              ;

fragment
CapitalLetters : CapitalLetter+
               ;

fragment
LetterOrDigitOrUnderline : Letter | CapitalLetter | Digit | '_'
                         ;

fragment
EscapeSequence :   ( '\\' ( 'b' | 't' | 'n' | 'f' | 'r' | '\"' | '\'' | '\\' ) ) 
               |   UnicodeEscape
               |   OctalEscape
               ;

fragment
HexDigit : [0-9] | [a-f] | [A-F]
         ;

fragment
UnicodeEscape : '\\' 'u' HexDigit HexDigit HexDigit HexDigit
              ;

fragment
OctalEscape :   ( '\\' [0-3] [0-7] [0-7] )
            |   ( '\\' [0-7] [0-7] )
            |   ( '\\' [0-7] )
            ;

WS : [ \t\r\n]+ -> skip
   ;

I'm using it like this:

final ByteArrayInputStream input = new ByteArrayInputStream("1.11".getBytes());
final TestLexer lexer = new TestLexer(new ANTLRInputStream(input));
final TestParser parser = new TestParser(new CommonTokenStream(lexer));
parser.start();

But this gives me:

line 1:0 token recognition error at: '1'
line 1:2 token recognition error at: '1'
line 1:3 token recognition error at: '1'
line 1:1 missing NonZeroDigit at '.'
line 1:4 missing Digit at '<EOF>'

What am I doing wrong? I'm using antlr v4.1.

Thanks in advance for helping.

回答1:

fragment lexer rules can only be used by other lexer rules: these will never become a token on their own. Therefor, you cannot use fragment rules in parser rules.



回答2:

The fragment is not the root cause.


First, try to reproduce your errors:

When compiling your Test.g4, it will appear warnings below:

warning(156): Test.g4:11:21: invalid escape sequence \"
warning(156): Test.g4:123:59: invalid escape sequence \"
warning(146): Test.g4:11:0: non-fragment lexer rule QUOTE can match the empty string
warning(125): Test.g4:3:8: implicit definition of token NonZeroDigit in parser
warning(125): Test.g4:3:25: implicit definition of token Digit in parser


After removing unused rules:

grammar Test;

start : NonZeroDigit '.' Digit Digit? EOF
      ;

fragment
NonZeroDigit : [1-9]
             ;

fragment
Digit : '0' | NonZeroDigit
      ;


Then compile it again and test it:

warning(125): Test.g4:3:8: implicit definition of token NonZeroDigit in parser
warning(125): Test.g4:3:25: implicit definition of token Digit in parser


line 1:0 token recognition error at: '1'
line 1:2 token recognition error at: '1'
line 1:3 token recognition error at: '1'
line 1:1 missing NonZeroDigit at '.'
line 1:4 missing Digit at '<EOF>'
(start <missing NonZeroDigit> . <missing Digit> <EOF>)

(try to reproduce your errors)

When applying 'fragment'

When applying 'fragment' on NonZeroDigit and Digit, the g4 will be equivalent to :

replace NonZeroDigit with [1-9]

grammar Test;

start : [1-9] '.' Digit Digit? EOF
      ;

fragment
Digit : '0' | [1-9]
      ;


replace Digit with ('0' | [1-9])

grammar Test;

start : [1-9] '.' ('0' | [1-9]) ('0' | [1-9])? EOF
      ;

but the parser rule start(the identifier starts with a lowercase alphabet) cannot be all letters.


Refer to The Definitive ANTLR 4 Reference Page73

lexer rule names with uppercase letters and parser rule names with lowercase letters. For example, ID is a lexical rule name, and expr is a parser rule name.


After removing 'fragment'

After removing 'fragment' from g4, there is still an unexpected error.

line 1:3 extraneous input '3' expecting {<EOF>, Digit}
(start 1 . 0 3 <EOF>)


Error study:
for NonZeroDigit:
if naming as nonZeroDigit, we will get:

syntax error: '1-9' came as a complete surprise to me while matching alternative

Because [1-9] is a letter (constant token). We need to name it with an uppercase prefix. (=lexer rule)


for Digit:
it containing an identifier NonZeroDigit, so we need to name it with a lowercase prefix. (=parser rule)


The correct Test.g4 should be:

grammar Test;

start : NonZeroDigit '.' digit digit? EOF
      ;

NonZeroDigit : [1-9]
             ;

digit : '0' | NonZeroDigit
      ;


If you want to use fragment, you should create a lexer rule Number because the rule ONLY consists of letters (constant tokens). And the identifier should start with an uppercase prefix, start is not

grammar Test;

start : Number EOF
      ;

Number : NonZeroDigit '.' Digit Digit?
       ;

fragment
NonZeroDigit : [1-9]
             ;

fragment
Digit : '0' | NonZeroDigit
      ;