I am currently trying to learn how to create my own lexical analyser, by hand. I had been using Flex (along with Bison) a lot to practice and learn how it works internally, but I am currently seeing at least 3 different solutions to develop my own.
- Using a list of REs, going through each and when one matches, simply return the associated token (see python docs about REs)
- Creating a DFA from REs (as does Flex for example: based on REs, create a big state machine)
- Creating my own 'state machine' with lots of switch cases or if statements (I think Lua does this for example)
I am confident that I can try each solution, but:
- Is there a case that one solution can't solve ?
- In what case would you use one solution rather than another ?
- And as the title says: which one produces the most efficient code ?
Thanks in advance!
The second and third alternatives will be equivalent if and only if you manage to write your state machine and your flex lexical description without any bugs. YMMV but my experience is that it's a lot easier to write (and read) the flex lexical description.
The first alternative is probably not equivalent, and it will not be trivial to make it equivalent in the general case.
The issue is what happens if more than one pattern matches the regular expression. (This issue also leads to subtle bugs when writing massive switch statements as above.) The generally accepted lexical strategy is to use the "maximal munch" rule in this case: Choose the pattern which results in the longest match, and if there is more than one such pattern, choose the one which appears first in the lexical definition.
As a simple example of why this rule is important, consider a language which has the keywords
do
anddouble
. Observe that the desirable behaviour is:In a standard (f)lex file this would be implemented as:
(F)lex will produce exactly the same scanner if the first two rules happen to be in a different order, although the third rule definitely must be at the end. But being able to reorder the first two rules is much less error-prone. Certainly, some people will write their lexical rules with the keywords in alphabetic order, as above, but others might choose to organize the keywords by syntactic function, so that
do
is lumped in withfor
,while
,done
, etc., anddouble
withint
,char
, etc. With the latter organization, it will be difficult for the programmer to ensure that overlapping keywords appear in any particular order, so it is useful that flex doesn't care; in this case (as in many other cases) choosing the longest match is certainly correct.If you create a list of regular expressions and just choose the first match, you will need to ensure that the regular expressions are in reverse order by match length, so that the one which matches the longest keyword comes first. (This puts
double
beforedo
, so alphabetically ordered keywords will fail.)Worse, it may not be immediately obvious which regular expression has the longest match. It's clear for keywords -- you can just reverse sort the literal patterns by length -- but in the general case, the maximal munch rule might not be a partial ordering over regular expressions: it might be the case that for some token, one regular expression has the longest match while another regular expression provides a longer match for a different token.
As an alternative, you could try all the regular expressions and keep track of the one which had the longest match. That will correctly implement maximal munch (but see below) but it's even more inefficient because every pattern must be matched against every token.
The actual code used in the Python documentation you link to actually creates a single regular expression from the provided patterns, by interpolating
|
operators between the various regexes. (This makes it impossible to use numbered captures, but that might not be an issue.)If Python regular expressions had Posix longest-match semantics, this would be equivalent to maximal munch, but it doesn't: a Python alternation will prefer the first match, unless a later match is required to continue the regular expression:
To get this right, you'll have to take a bit of care to ensure that your regular expressions are correctly ordered and don't interfere with each others' matches. (Not all regex libraries work the same way as Python, but many do. You'll need to check the documentation, and perhaps do some experimenting.)
In short, for an individual language if you're prepared to put some work into it, you will be able to hand-build lexers which work "correctly" (assuming the language insists on maximal munch, as most standardised languages mostly do), but it's definitely a chore. And not just for you: it will be additional work for anyone who wants to understand or validate your code.
So in terms of efficiency of writing code (including debugging), I'd say that lexical generators like (f)lex are a clear winner.
There's a long-standing meme that hand-built (or open-coded) lexical generators are faster. If you want to experiment with that, you could try using
re2c
, which produces highly-optimised open-coded lexical scanners. (By open-coded, I mean that they don't use transition tables.) That theory might or might not be true for a given set of lexical rules, because the table-based lexers (as produced by (f)lex) are generally much smaller in code size, and therefore make more efficient use of processor caches. If you choose flex's fast (but larger) table options, then the inner loop of the scanner is very short and contains only one conditional branch. (But branch prediction on that single branch is not going to be highly effective). By contrast the open-coded scanners have a large amount of code in the loop with a lot of conditional branches, most of which are reasonably easy to predict. (It's not that the execution path is longer; rather that the inner loop is not short enough to cache.)Either way, I think it's reasonable to say that the difference is not going to break the bank and my advice is always to go with the lexer which is easier for other people to read, particularly if you ever plan on asking for help with it on SO :-)