ANTLR4 lexer rules don't work as expected

I want to write a lexer rule about the month and the year, the rule is(with regular expression):

"hello"[0-9]{1,2}"ever"([0-9]{2}([0-9]{2})?)?

the "hello" and "ever" literals are just for debuging.

that's say, one or two digits for month, and two or four digits for year. And what's more, the year part could be bypass.

such as: Aug 2015 ->hello08ever2015 or hello8ever2015 or hello8ever15 or hello8ever or hello08ever; Oct 2015 -> hello10ever2015 or hello10ever15 or hello10ever;

and my lexer rules are as follow(ANTLR4):

grammar Hello;
r  : 'hello' TimeDate 'ever' TimeYear? ;        

TimeDate : Digit Digit?;

TimeYear : TwoDigit TwoDigit?;

TwoDigit : Digit Digit;

Digit : [0-9] ;             

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

But it seems not working. Here're some logs for my testing:

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever2014
^Z
(r hello 20 ever 2014)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever20
^Z
(r hello 2 ever)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever14
^Z
(r hello 20 ever)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever2014
^Z
(r hello 2 ever 2014)

for input: hello2ever20, it can't identify the year part '20'; for input: hello20ever14, it can't identify the year part '14';

Anyone could help on this???

thanks!!

标签： antlr4

1条回答

Fickle 薄情

2楼-- · 2019-01-27 05:46

You must realise that ANTLR's lexer rules are matched according their position in the grammar file. The lexer does not "listen" what the parser might need at a certain position in a parser rule. The lexer tries to match as much characters as possible, and when 2 (or more) rules match the same amount of characters, the rule defined first will win.

In your case that means that 15 will always be tokenized as a TimeDate and never as a TimeYear because both rules match 15 but TimeDate is defined first. 2015 will be tokenized as a TimeYear because no other rule matches 4 digits.

A solution would be to change TimeYear into a parser rule:

timeYear
 : TimeDate TimeDate?
 ;

0人赞添加讨论(0) 举报

ANTLR4 lexer rules don't work as expected

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间