ANTLR behaviour with conflicting tokens

2019-07-09 01:59发布

问题:

How is ANTLR lexer behavior defined in the case of conflicting tokens? Let me explain what I mean by "conflicting" tokens. For example, assume that the following is defined:

INT_STAGE       :   '1'..'6';
INT             :   '0'..'9'+;

There is a conflict here, because after reading a sequence of digits, the lexer would not know whether there is one INT or many INT_STAGE tokens (or different combinations of both). After a test, it looks like that if INT is defined after INT_STAGE, the lexer would prefer to find INT_STAGE, but maybe not INT then? Otherwise, no INT_STAGE would ever be found.

Another example would be:

FOOL: ' fool'
FOO: 'foo'
ID              :   ('a'..'z'|'A'..'Z'|'_'|'%') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'%')*;

I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.

回答1:

The following logic applies:

  1. the lexer matches as much characters as possible
  2. if after applying rule 1, there are 2 or more rules that match the same amount of characters, the rule defined first will "win"

Taking this into account, the input "1", "2", ..., "6" is tokenized as an INT_STAGE: both INT_STAGE and INT match the same amount of characters, but INT_STAGE is defined first.

The input "12" is tokenized as a INT since it matches the most characters.

I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.

That is correct.