Why does not ANTLR4 match “of” as a word and “,” a

2019-03-04 16:06发布

问题:

I have a Hello.g4 grammar file with a grammar definition:

definition : wordsWithPunctuation ;
words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )*  ;
NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

Now, if I am trying to build a parse tree from the following input:

a b c d of at of abc bcd of
a b c d at abc, bcd
a b c d of at of abc, bcd of

it returns errors:

Hello::definition:1:31: extraneous input 'of' expecting {<EOF>, '(', '"', WORD, PUNCTUATION}

though the:

a b c d  at:  abc bcd!

works correct.

What is wrong with the grammar or input or interpreter?

If I modify the wordsWithPunctuation rule, by adding (... | 'of' | ',' word | ...) then it matches the input completely, but it looks suspicious for me - how the word of is different from the word a or abc? Or why the , is different from other punctuation characters (i.e., why does it match the : or !, but not ,?)?

Update1:

I am working with ANTLR4 plugin for Eclipse, so the project build happens with the following output:

ANTLR Tool v4.2.2 (/var/folders/.../antlr-4.2.2-complete.jar)
Hello.g4 -o /Users/.../eclipse_workspace/antlr_test_project/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8

Update2:

the presented above grammar is just a partial from:

grammar Hello;

text : (entry)+ ;

entry : blub 'abrr' '-' ('1')? '.' ('(' NUMBER ')')? sims '-' '(' definitionAndExamples ')' 'Hello' 'all' 'the' 'people' 'of' 'the' 'world';

blub : WORD ;

sims : sim (',' sim)* ;
sim : words ;

definitionAndExamples : definitions (';' examples)? ;

definitions : definition (';' definition )* ;
definition : wordsWithPunctuation ;

examples : example (';' example )* ;
example : '"' wordsWithPunctuation '"' ;

words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )*  ;

NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

It looks now for me, that the words from the entry rule somehow breaking the other rules within the entry rule. But why? Is it a kind an anti-pattern in the grammar?

回答1:

By including 'of' in a parser rule, ANTLR is creating an implicit anonymous token to represent that input. The word of will always have that special token type, so it will never have the type WORD. The only place it may appear in your parse tree is at a location where 'of' appears in a parser rule.

You can prevent ANTLR from creating these anonymous token types by separating your grammar into a separate lexer grammar HelloLexer in HelloLexer.g4 and parser grammar HelloParser in HelloParser.g4. I highly recommend you always use this form for the following reasons:

  1. Lexer modes only work if you do this.
  2. Implicitly-defined tokens are one of the most common sources of bugs in a grammar, and separating the grammar prevents it from ever happening.

Once you have the grammar separated, you can update your word parser rule to allow the special token of to be treated as a word.

word
  : WORD
  | 'of'
  | ... other keywords which are also "words"
  ;