Parsing single line comments

2019-07-18 05:52发布

问题:

I am trying to write a grammar for parsing single line comments. Comments starts with '--' can appear anywhere in the file.

My basic grammar looks like below.

Grammar (aa.g4):

grammar aa;

statement
    :   commentStatement* ifStatement
    |   commentStatement* returnStatement
    ;
ifStatement
    :   'if' '(' expression ')'
        returnStatement+
    ;

returnStatement  :   'return' expression ';' ;
commentStatement :   '--' (.+?) '\\n'? ;
expression       :   IDENTIFIER ;

IDENTIFIER       :   [a-z]([A-Za-z0-9\-\_])* ;
NEWLINE          :   '\r'? '\n'    -> skip ;
WS               :   [ \t\r\f\n]+ -> skip ;

Test class:

public class aaTest {
    static class aaListener extends aaBaseListener {
        public void enterCommentStatement(CommentStatementContext ctx) {
            System.out.println(ctx.getText());
        }
    }

    public static void main(String[] args) throws Exception {
        InputStream is = new FileInputStream("aa.txt");
        CharStream stream = new ANTLRInputStream(is);
        aaLexer lexer = new aaLexer(stream);
        TokenStream tokenStream = new CommonTokenStream(lexer);
        aaParser parser = new aaParser(tokenStream);
        ParseTree aParseTree = parser.statement();
        ParseTreeWalker aWalker = new ParseTreeWalker();
        aWalker.walk(new aaListener(), aParseTree);;
    }
}

Input:

--comment1
-- if comment
if (x) --mid if comment
  --end comment
return result;

Output:

--comment1a
--ifcommentif(x)     <<< error output
--midifcomment
--endcomment

Queries:

  1. What is the issue in parsing error output above. I need only "-- if comment" to be printed.
  2. How do I get and output actual comment with spaces.

回答1:

First, you should define your line comment rule as you truly mean it. The non-greedy operator is not performing the way you intend.

LineComment
  : '--' ~[\r\n]* -> channel(HIDDEN)
  ;

Second, if you want the token stream to contain information about whitespace and newline characters, you should move them to the hidden channel instead of using the skip command. The skip command completely drops the token, making it appear as though the text was never even in the input at all.

NEWLINE
  : '\r'? '\n' -> channel(HIDDEN)
  ;

WS
  : [ \t\f]+ -> channel(HIDDEN)
  ;

Comments will not appear in the parse tree, and you won't use LineComment in any of your parser rules. To get information about these tokens before or after another token in the parse tree, you can examine the tokens around a specific index directly (using TokenStream.get(int)) or with a utility method like BufferedTokenStream.getHiddenTokensToRight or BufferedTokenStream.getHiddenTokensToLeft.



标签: antlr4