I want to write a lexer rule about the month and the year, the rule is(with regular expression):
"hello"[0-9]{1,2}"ever"([0-9]{2}([0-9]{2})?)?
the "hello" and "ever" literals are just for debuging.
that's say, one or two digits for month, and two or four digits for year. And what's more, the year part could be bypass.
such as: Aug 2015 ->hello08ever2015 or hello8ever2015 or hello8ever15 or hello8ever or hello08ever; Oct 2015 -> hello10ever2015 or hello10ever15 or hello10ever;
and my lexer rules are as follow(ANTLR4):
grammar Hello;
r : 'hello' TimeDate 'ever' TimeYear? ;
TimeDate : Digit Digit?;
TimeYear : TwoDigit TwoDigit?;
TwoDigit : Digit Digit;
Digit : [0-9] ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
But it seems not working. Here're some logs for my testing:
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever2014
^Z
(r hello 20 ever 2014)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever20
^Z
(r hello 2 ever)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever14
^Z
(r hello 20 ever)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever2014
^Z
(r hello 2 ever 2014)
for input: hello2ever20, it can't identify the year part '20'; for input: hello20ever14, it can't identify the year part '14';
Anyone could help on this???
thanks!!
You must realise that ANTLR's lexer rules are matched according their position in the grammar file. The lexer does not "listen" what the parser might need at a certain position in a parser rule. The lexer tries to match as much characters as possible, and when 2 (or more) rules match the same amount of characters, the rule defined first will win.
In your case that means that
15
will always be tokenized as aTimeDate
and never as aTimeYear
because both rules match15
butTimeDate
is defined first.2015
will be tokenized as aTimeYear
because no other rule matches 4 digits.A solution would be to change
TimeYear
into a parser rule: