ANTLR4 RegEx lexer modes

I am working on a Regx parser for RegEx inside XSD. My previous problem was descrived here: ANTLR4 parsing RegEx

I have split the Lexer and Parser since than. Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside. This is my lexer grammar:

lexer grammar RegExLexer;

Char    : ALPHA ;
Int     : DIGIT ;

LBrack  : '[' ;//-> pushMode(modeRange) ;
RBrack  : ']' ;//-> popMode ;
LBrace  : '(' ;
RBrace  : ')' ;
Semi    : ';' ;
Comma   : ',' ;
Asterisk: '*' ;
Plus    : '+' ;
Dot     : '.' ;
Dash    : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe    : '|' ;
Esc     : '\\' ;

WS : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;

And here is the example:

[0-9a-z()]+

I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice. I have read the reference about this and I still don't get what i should do.

How do I implement the modes?

标签： regex antlr4

2条回答

三岁会撩人

2楼-- · 2019-08-11 23:33

Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:

lexer grammar RegexLexer;

START_CHAR_CLASS
 : '[' -> pushMode(CharClass)
 ;

START_GROUP
 : '('
 ;

END_GROUP
 : ')'
 ;

PLAIN_ATOM
 : ~[()\[\]]
 ;

mode CharClass;

END_CHAR_CLASS
 : ']' -> popMode
 ;

CHAR_CLASS_ATOM
 : ~[\r\n\\\]]
 | '\\' .
 ;

After generating the lexer, you can use the following class to test it:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;

public class Main {
    public static void main(String[] args) {
        RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
        for (Token token : lexer.getAllTokens()) {
            System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
        }
    }
}

And if you run this Main class, the follwoing will be printed to your console:

START_GROUP          (
START_CHAR_CLASS     [
CHAR_CLASS_ATOM      (
CHAR_CLASS_ATOM      )
CHAR_CLASS_ATOM      \]
END_CHAR_CLASS       ]
END_GROUP            )

As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.

0人赞添加讨论(0) 举报

冷血范

3楼-- · 2019-08-11 23:38

You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

0人赞添加讨论(0) 举报

ANTLR4 RegEx lexer modes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间