ANTLR4: Invoke different sub-parser for specific r

2019-08-28 23:26发布

问题:

Consider this very simplified example where an input of the following form should be matched

mykey -> This is the value

My real case is much more complex but this will do for showing what I try to achieve. mykey is an ID while on the right side of -> we have a set of Words. If I use

grammar Root;

parse
    : ID '->' value
    ;

value
    : Word+
    ;

ID
    : ('a'..'z')+
    ;


Word
    : ('a'..'z' | 'A'..'Z' | '0'..'9')+
    ;

WS
    : ' ' -> skip
    ;

the example won't be parsed because the lexer will give an ID token for the first is which is not matched by Word+. In my real example, the value-language is vastly different and I'd like to parse it with a different grammar.

I have considered different solutions:

  1. Switching the lexer mode but AFAIK, switching the lexer to a different mode can only happen in a lexer rule. This is problematic for this case and my real case as well as there are no unique tokens that start and end the value part. What I would need is something like "tokenize value with different rules" which is, of course, stupid, because lexer and parser act independently and as soon as the parser starts, everything is already tokenized

  2. Using a different grammar for value. When I see this right, the approach of importing a grammar won't work, since it always combines two grammars leading to the same situation of wrong tokenization.

  3. Creating a first crude parser, that accepts the whole language but doesn't create the correct tree for value. I could then use a visitor and reparse value nodes with a different sub-parser possibly inserting a new, correct subtree for value. This feels a bit clumsy.

If you need a simple real-world application, then you could consider strings in Java. Some of them might be a regex which needs to be parsed with a completely different parser. It is similar to injected languages you can use inside IDEA.

Question: Is there an idiomatic way in ANTRL4 to parse a specific rule with a different grammar? Best case would be if I can specify this on the grammar level so that the resulting AST is a combination of the outer language that contains a sub-tree of the injected language.

回答1:

Actually, using modes is the idiomatic solution. Just requires being a bit creative in identifying the mode guards:

parser grammar RootParser ;

options {
    tokenVocab = RootLexer ;
}

parse   : ID RARROW value EOF ;
value   : WORD+ ;

and

lexer grammar RootLexer ;

ID      : [a-z]+      ;
RARROW  : '->' -> pushMode(value) ;

mode value ;
    EOL     : [\r\n]+ -> popMode, skip ;
    WORD    : [a-zA-Z0-9]+  ;
    WS      : ' ' -> skip   ;


回答2:

You can try to transfert the decision what a word is into the parser:

grammar Root;

parse
  : ID '->' value
  ;

value
  : word+
  ;

word : Word | ID;

//the same lexer rules as above

This will parse

This  -> Word -> word
is    -> ID   -> word
the   -> ID   -> word
value -> ID   -> word

So at the level of the parser nodes you will have only words.