ANTLR4 lexer rules don't work as expected

2019-01-27 05:47发布

问题:

I want to write a lexer rule about the month and the year, the rule is(with regular expression):

"hello"[0-9]{1,2}"ever"([0-9]{2}([0-9]{2})?)?

the "hello" and "ever" literals are just for debuging.

that's say, one or two digits for month, and two or four digits for year. And what's more, the year part could be bypass.

such as: Aug 2015 ->hello08ever2015 or hello8ever2015 or hello8ever15 or hello8ever or hello08ever; Oct 2015 -> hello10ever2015 or hello10ever15 or hello10ever;

and my lexer rules are as follow(ANTLR4):

grammar Hello;
r  : 'hello' TimeDate 'ever' TimeYear? ;        

TimeDate : Digit Digit?;

TimeYear : TwoDigit TwoDigit?;

TwoDigit : Digit Digit;

Digit : [0-9] ;             

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

But it seems not working. Here're some logs for my testing:

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever2014
^Z
(r hello 20 ever 2014)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever20
^Z
(r hello 2 ever)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever14
^Z
(r hello 20 ever)

C:\antlr\workspace\demo>grun Hello r -tree -gui

C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever2014
^Z
(r hello 2 ever 2014)

for input: hello2ever20, it can't identify the year part '20'; for input: hello20ever14, it can't identify the year part '14';

Anyone could help on this???

thanks!!

回答1:

You must realise that ANTLR's lexer rules are matched according their position in the grammar file. The lexer does not "listen" what the parser might need at a certain position in a parser rule. The lexer tries to match as much characters as possible, and when 2 (or more) rules match the same amount of characters, the rule defined first will win.

In your case that means that 15 will always be tokenized as a TimeDate and never as a TimeYear because both rules match 15 but TimeDate is defined first. 2015 will be tokenized as a TimeYear because no other rule matches 4 digits.

A solution would be to change TimeYear into a parser rule:

timeYear
 : TimeDate TimeDate?
 ;


标签: antlr4