ANTLR4 RegEx lexer modes

2019-08-11 23:21发布

I am working on a Regx parser for RegEx inside XSD. My previous problem was descrived here: ANTLR4 parsing RegEx

I have split the Lexer and Parser since than. Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside. This is my lexer grammar:

lexer grammar RegExLexer;

Char    : ALPHA ;
Int     : DIGIT ;

LBrack  : '[' ;//-> pushMode(modeRange) ;
RBrack  : ']' ;//-> popMode ;
LBrace  : '(' ;
RBrace  : ')' ;
Semi    : ';' ;
Comma   : ',' ;
Asterisk: '*' ;
Plus    : '+' ;
Dot     : '.' ;
Dash    : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe    : '|' ;
Esc     : '\\' ;

WS : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;

And here is the example:

[0-9a-z()]+

I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice. I have read the reference about this and I still don't get what i should do.

How do I implement the modes?

标签: regex antlr4
2条回答
三岁会撩人
2楼-- · 2019-08-11 23:33

Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:

lexer grammar RegexLexer;

START_CHAR_CLASS
 : '[' -> pushMode(CharClass)
 ;

START_GROUP
 : '('
 ;

END_GROUP
 : ')'
 ;

PLAIN_ATOM
 : ~[()\[\]]
 ;

mode CharClass;

END_CHAR_CLASS
 : ']' -> popMode
 ;

CHAR_CLASS_ATOM
 : ~[\r\n\\\]]
 | '\\' .
 ;

After generating the lexer, you can use the following class to test it:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;

public class Main {
    public static void main(String[] args) {
        RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
        for (Token token : lexer.getAllTokens()) {
            System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
        }
    }
}

And if you run this Main class, the follwoing will be printed to your console:

START_GROUP          (
START_CHAR_CLASS     [
CHAR_CLASS_ATOM      (
CHAR_CLASS_ATOM      )
CHAR_CLASS_ATOM      \]
END_CHAR_CLASS       ]
END_GROUP            )

As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.

查看更多
冷血范
3楼-- · 2019-08-11 23:38

You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

查看更多
登录 后发表回答