Switching to island mode on multi-character token

2019-06-07 15:22发布

问题:

I am working on a grammar that is basically an island grammar.

Let's say the "island" is everything between braces, the "sea" is everything that is not. Like this:

{ (island content) }

Then this simple grammar works:

IslandStart
:
    '{' -> pushMode(Island)
;

Fluff
:
    ~[\{\}]+
;

....

But I'm having trouble to come up with a similar solution to a case where I want the complex (multi-character) opening for my "island" block, like this:

{# (island content) }

In this case I don't know how to make a rule for "Fluff" (everything but my opening sequence).

IslandStart
    :
        '{#' -> pushMode(Island)
    ;

Fluff
    :
        ~[\{\}]+ /* Should now include opening braces as well 
                    if they are not immaediately followed by # sign */
    ;

How do I make it work?


EDIT: GRosenberg came up with a solution but I get a lot of tokens (one per character) with it. This is an example to demonstrate this behaviour:

My lexer grammar:

lexer grammar Demolex;

IslandStart
    :
        '{$' -> pushMode(Island)
    ;


Fluff
    : 
          '{' ~'$' .* // any 2+ char seq that starts with '{', but not '{#'
        | '{' '$$' .* // starts with hypothetical not IslandStart marker
        | '{'         // just the 1 char 
        | .*? ~'{'    // minimum sequence that ends before an '{'
    ;

mode Island;

IslandEnd
    :
        '}' -> popMode
    ;

Simplest parser grammar:

grammar Demo;
options { tokenVocab = Demolex; }

template
    :
        Fluff+
    ;

This generates a tree with a lot of tokens from the input "somanytokens" when I debug it in antlr4 plugin for Eclipse:

It's not likely that it's a plugin problem. I can easily come up with a token definition that will a result in a big fat token in the tree.

Actually, even the simplest form of grammar gives this result:

grammar Demo2;

template4
    :
        Fluff+
    ;

Fluff
    : 
         .*? ~'{'    // minimum sequence that ends before an '{'
    ;

回答1:

Just need to specify the complement of the sequence difference:

IslandStart : '{#' -> pushMode(Island) ;

Fluff       : '{' ~'#' .* // any 2+ char seq that starts with '{', but not '{#'
            | '{' '##' .* // starts with hypothetical not IslandStart marker
            | '{'         // just the 1 char 
            | .*? ~'{'    // minimum sequence that ends before an '{'
            ;

Fluff alt2 works when it is the longer match relative to IslandStart. Fluff alt3 works only when IslandStart and Fluff alt1 do not match a character sequence starting with '{'. Fluff alt4 is the catchall for content up to but not including a '{', allowing the lexer to consider sequences aligned on an '{'.

Update

Lets make it a more reasonably complete example grammar

parser grammar TestParser;

options{
    tokenVocab=TestLexer;
}

template : ( Fluff | Stuff )+ EOF ;

and

lexer grammar TestLexer;

IslandStart : '{' '$' -> pushMode(Island),more ;

Fluff : '{' ~'$' ~'{'*? '}'     // any 2+ char seq that starts with '{', but not '{$'
      | '{' '$' '$' ~'{'*? '}'  // or starts with hypothetical not IslandStart marker
      | '{' '}'                 // just the empty pair
      | ~'{'+                   // minimum sequence that ends before an '{'
      ;

mode Island;

Stuff : '}' -> popMode ;
Char  : .   -> more    ;

with input so{$Island}many{}tokens{$$notIsland}and{inner}end

Token dump:

Fluff: [@0,0:1='so',<1>,1:0]
Stuff: [@1,2:10='{$Island}',<2>,1:2]
Fluff: [@2,11:14='many',<1>,1:11]
Fluff: [@3,15:16='{}',<1>,1:15]
Fluff: [@4,17:22='tokens',<1>,1:17]
Fluff: [@5,23:35='{$$notIsland}',<1>,1:23]
Fluff: [@6,36:38='and',<1>,1:36]
Fluff: [@7,39:45='{inner}',<1>,1:39]
Fluff: [@8,46:48='end',<1>,1:46]

Parse tree:

(template so {$Island} many {} tokens {$$notIsland} and {inner} end <EOF>)

Operation of the lexer rules remains the same. Changes were made to accommodate the right paren match terminals. Alt4, as simplified, works as originally intended. Not entirely sure why it was a problem for Antlr to begin with, but simpler is better in any case.



标签: antlr antlr4