Antlrworks - extraneous input

2019-07-21 07:20发布

问题:

I am new in this stuff, and for that reason I will need your help.. I am trying to parse the Wikipedia Dump, and my first step is to map each rule defined by them into ANTLR, unfortunally I got my first barrier:

line 1:8 extraneous input ''''' expecting '\'\''

I am not understanding what is going on, please lend me your help.

My code:

grammar Test;

options {
    language = Java;
}

parse
    :  term+ EOF
    ;

term 
    :  IDENT
    |  '[[' term ']]'
    |  '\'\'' term '\'\''
    |  '\'\'\'' term '\'\'\''
    ;    

IDENT
    :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*
    ;

Input '''''Hello World'''''

回答1:

A lexer rule must always match at least 1 character. Your rule:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;

matches an empty string (of which there are an infinite amount of). Change the * to a +:

IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

EDIT

Input '''''Hello World'''''

Although you put literal tokens inside parser rules ('\'\'\'', '\'\'', etc.), you must understand that they are not created at the behest of the parser. The lexer follows strict rules to create tokens:

  1. it tries to match as much as possible
  2. if 2 different lexer rules match the same amount of characters, the one defined first will get precedence

Let's give your literal tokens a name:

BRACKET_OPEN  : '[[';
BRACKET_CLOSE : ']]';
Q3            : '\'\'\'';
Q2            : '\'\'';
IDENT         :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;

Now, because of rule #1 (match as much as possible), the input '''''Hello World''''' will be tokenized as follows:

  • Q3
  • Q2
  • IDENT
  • Q3 (yes, a Q3!)
  • Q2

But your parser rule term will only accept Q3 Q2 IDENT Q2 Q3, so it is correct that your input fails to parse properly.

Also, I recommend you not use the interpreter: it's rather buggy. The debugger works like a charm though!