I am new in this stuff, and for that reason I will need your help..
I am trying to parse the Wikipedia Dump, and my first step is to map each rule defined by them into ANTLR, unfortunally I got my first barrier:
line 1:8 extraneous input ''''' expecting '\'\''
I am not understanding what is going on, please lend me your help.
My code:
grammar Test;
options {
language = Java;
}
parse
: term+ EOF
;
term
: IDENT
| '[[' term ']]'
| '\'\'' term '\'\''
| '\'\'\'' term '\'\'\''
;
IDENT
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*
;
Input
'''''Hello World'''''
A lexer rule must always match at least 1 character. Your rule:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')*;
matches an empty string (of which there are an infinite amount of). Change the *
to a +
:
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
EDIT
Input '''''Hello World'''''
Although you put literal tokens inside parser rules ('\'\'\''
, '\'\''
, etc.), you must understand that they are not created at the behest of the parser. The lexer follows strict rules to create tokens:
- it tries to match as much as possible
- if 2 different lexer rules match the same amount of characters, the one defined first will get precedence
Let's give your literal tokens a name:
BRACKET_OPEN : '[[';
BRACKET_CLOSE : ']]';
Q3 : '\'\'\'';
Q2 : '\'\'';
IDENT : ('a'..'z' | 'A'..'Z' | '0'..'9' | '=' | '#' | '"' | ' ')+;
Now, because of rule #1 (match as much as possible), the input '''''Hello World'''''
will be tokenized as follows:
Q3
Q2
IDENT
Q3
(yes, a Q3
!)
Q2
But your parser rule term
will only accept Q3 Q2 IDENT Q2 Q3
, so it is correct that your input fails to parse properly.
Also, I recommend you not use the interpreter: it's rather buggy. The debugger works like a charm though!