Getting plain text in antlr instead of tokens

2019-08-01 08:10发布

问题:

I'm trying to create a parser using antlr. My grammar is as follows.

code : codeBlock* EOF;

codeBlock
: text
| tag1Ops
| tag2Ops
;

tag1Ops: START_1_TAG ID END_2_TAG ;

tag2Ops: START_2_TAG ID END_2_TAG ;

text: ~(START_1_TAG|START_2_TAG)+;

START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;

ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;

WS :  ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);

SPACES: SPACE+;

ANY_CHAR : .;

fragment SPACE : ' ' | '\r' | '\n' | '\t' ;

Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.

For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.

I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.

回答1:

First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.

For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.

You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:

// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;

START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT        : ( ~[<] | '<' ~[<%] )+;

mode IN_TAG;
  ID         : [A-Za-z_][A-Za-z0-9_]*;
  INT_NUMBER : [0-9]+;
  END_1_TAG  : '%>' -> popMode;
  END_2_TAG  : '>>' -> popMode;
  SPACE      : [ \t\r\n] -> channel(HIDDEN);

To test this lexer grammar, run this class:

import org.antlr.v4.runtime.*;

public class Main {

  public static void main(String[] args) {

    String source = "<%FOO%>FOO BAR<<123>>456 mu!";
    DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
    CommonTokenStream tokenStream = new CommonTokenStream(lexer);
    tokenStream.fill();

    for (Token t : tokenStream.getTokens()) {
      System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
    }
  }
}

which will print:

START_1_TAG          <%
ID                   FOO
END_1_TAG            %>
TEXT                 FOO BAR
START_2_TAG          <<
INT_NUMBER           123
END_2_TAG            >>
TEXT                 456 mu!
EOF                  <EOF>

Use your lexer grammar in a separate parser grammar like this:

// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;

options {
  tokenVocab=DemoLexer;
}

code
 : codeBlock* EOF
 ;

...

EDIT

[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?

A breakdown of ( ~[<] | '<' ~[<%] )+:

(            # start group
  ~[<]       #   match any char other than '<'
  |          #   OR
  '<' ~[<%]  #   match a '<' followed by any char other than '<' and '%'
)+           # end group, and repeat it once or more

And, can lexical modes be considered an alternative to semantic predicates?

Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.



标签: antlr antlr4