Antlr get Sub-Tokens

2019-08-30 03:22发布

问题:

Forgive me if my terminology is off.

Lets say I have this bit of simplified grammar:

// parser
expr : COMPARATIVE;

// lexer
WS : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+; 
COMPARATOR 
        : 'vs'
    | 'versus'
        ;
ITEM 
        : 'boy'
        | 'girl'
        ;
COMPARATIVE :ITEM WS* COMPARATOR WS* ITEM;

So this will of course match 'boy vs girl' or 'girl vs boy', etc. But my question is that is when I create a Lexer, i.e.

CharStream stream = new ANTLRInputStream("boy vs girl");
SearchLexer lex = new SearchLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lex);
tokens.fill();
for(Token token : tokens) {
    System.out.print(token.getType() + " [" + token.getText() + "] ");
}

This prints out something like this: 9 [boy vs girl], i.e. it matches my query fine, but now I want to be able to do something like, get the sub tokens of this current token.

My intuition tells me I need to use trees, but really don't know how to do this in Antlr4 for my example. Thanks

回答1:

Currently, COMPARATIVE is a lexer rule which means it will try to make a single token from all the text that matches the rule. You should instead make a parser rule comparative:

comparative : ITEM WS* COMPARATOR WS* ITEM;

Since COMPARATIVE will no longer be considered a single token, you'll instead get individual tokens for ITEM, WS, and COMPARATOR.

Two side notes:

  1. If whitespace is not significant, you can hide it from the parser rules like this:

    WS : ('\t' | ' ' | '\r' | '\n'| '\u000C')+ -> channel(HIDDEN);
    

    You can then simplify your comparative parser rule to simply be:

    comparative : ITEM COMPARATOR ITEM;
    
  2. In ANTLR 4, you can simplify character sets using a new syntax:

    WS : [ \t\r\n\u000C]+ -> channel(HIDDEN);
    


标签: java antlr4