I made this experiment for Flex to see if I enter ABC, if it will see all A, AB, ABC or only ABC or only the first match in the list of expressions.
%{
#include <stdio.h>
%}
%%
A puts("got A");
AB puts("got AB");
ABC puts("got ABC");
%%
int main(int argc, char **argv)
{
yylex();
return 0;
}
When I enter ABC after compiling and running the program, it responds with "Got ABC" which really surprises me since I thought lex doesn't keep track of visited text, and only finds the first match; but actually, it seems to find the longest match.
What strategy does Flex use to respond to A if and only if there is no longer match?
The fact that (F)lex uses the maximal-munch principle should hardly be surprising, since it is well documented in the Flex manual:
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text…. If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen.
(First paragraph of the section "How the input is matched")
The precise algorithm is exceedingly simple: every time a token is requested, flex scans the text, moving through the DFA. Every time it hits an accepting state, it records the current text position. When no more transitions are possible, it returns to the last recorded accept position, and that becomes the end of the token.
The consequence is that (F)lex can scan the same text multiple times, although it only scans once for each token.
A set of lexical rules which require excessive back-tracking will slow down the lexical scan. This is discussed in the Flex manual section Performance Considerations, along with some strategies to avoid the issue. However, except in pathological cases, the overhead from back-tracking is not noticeable.