I've heard that the "|" operator slows down regex matching, and it certainly seems to be true in Perl, for example.
Do I have to worry about that when I build scanners with tools like the Flex lexer generator?
I've heard that the "|" operator slows down regex matching, and it certainly seems to be true in Perl, for example.
Do I have to worry about that when I build scanners with tools like the Flex lexer generator?
Absolutely not. Flex (like the lex lexer generator on which it was based, and most other similar compiler-construction tools) compiles all of the regular expressions in the scanner into a single Deterministic Finite State Automaton (DFA). The DFA never backs up during the scan of a lexical token, and neither the complexity of the component regular expressions nor the precise operators they use make the slightest difference.
This is quite different from Perl's approach to regex matching. Perl (at least under some circumstances) will explore the possible subpatterns in an alternation expression one at a time, with the consequence that a significant performance hit can be noticed. This effect was demonstrated by a Perl benchmark constructed by @sln in this answer.
To show that this is not true with Flex-generated parsers, I constructed a very similar benchmark in Flex. The two Flex input files:
(With |):
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
static int echo_match = 0;
static const char* regex="'''([^']|['][^']|[']['][^'])*'''";
%}
%option noyywrap noinput nounput nodefault
%%
[^'\n]+ { }
\n { }
'[^'\n]*' {
if (echo_match) printf("%s\n", yytext);
}
'''([^']|['][^']|[']['][^'])*''' {
if (echo_match) printf("%s\n", yytext);
}
' { fputs("Unmatched quote\n", stderr); }
Without | is identical, except for the regex pattern (in two places):
'''[^']*(?:[']{1,2}[^']+)*'''
I then used this main
to test each of the scanners:
static const char* dataset[] = {
"'''Set 1 - this\nis another\nmultiline\nstring'''",
"'''Set 2 - this\nis' another\nmul'tiline\nst''ring'''"
};
#define COUNTOF(ARY) (sizeof(ARY)/sizeof(ARY[0]))
int main(int argc, char** argv) {
int reps = 500000;
printf("-----------------------\nUsing regex: %s\n", regex);
for (unsigned d = 0; d < COUNTOF(dataset); ++d) {
echo_match = 1;
yy_scan_string(dataset[d]);
yylex();
yy_delete_buffer(YY_CURRENT_BUFFER);
echo_match = 0;
struct timespec before, after;
if (clock_gettime(CLOCK_REALTIME, &before)) perror("gettime");
for (int r = 0; r < reps; ++r) {
yy_scan_string(dataset[d]);
yylex();
yy_delete_buffer(YY_CURRENT_BUFFER);
}
if (clock_gettime(CLOCK_REALTIME, &after)) perror("gettime");
unsigned long elapsed =
(after.tv_sec - before.tv_sec) * 1000000
+ (after.tv_nsec - before.tv_nsec) / 1000;
printf("Wall-clock: %ld microseconds\n", elapsed);
}
return 0;
}
Random results of the two sets of benchmarks (from a set of 10 runs of each):
$ tail -n+$((1+12*(RAND/10))) threeq.log | head -n12
-----------------------
Using regex: '''([^']|['][^']|[']['][^'])*'''
'''Set 1 - this
is another
multiline
string'''
Wall-clock: 243410 microseconds
'''Set 2 - this
is' another
mul'tiline
st''ring'''
Wall-clock: 233519 microseconds
$ tail -n+$((1+12*(RAND/10))) threeq2.log | head -n12
-----------------------
Using regex: '''[^']*(?:[']{1,2}[^']+)*'''
'''Set 1 - this
is another
multiline
string'''
Wall-clock: 246842 microseconds
'''Set 2 - this
is' another
mul'tiline
st''ring'''
Wall-clock: 242191 microseconds
There are certain cases where Flex needs to rescan input after a token has been recognized, but these have nothing to do with the alternation operator |. Because Flex always tries to find the longest match, it might need to continue scanning even after a token has been recognized, in case that token were a prefix of another possible token. If the longer token proves to not be matchable, then the Flex scanner will back up to the end of the longest token matched, and the remaining characters will be rescanned in the next token.
As an example, in C, .
and ...
are both valid tokens, but ..
is not. If ..
appears in the input, then a Flex-built scanner would have to continue scanning after the first . in order to see if it could match ...
. However, when the third character proves not to be a ., it will have to return the .
token, and then the second . will be rescanned. (That will almost always be a syntax error, so the issue is not very serious.)
Another case where Flex-built scanners will need to rescan is when the / (trailing context) operator is used. Since the trailing context is not actually part of the returned token, it must be rescanned as part of the next token.
Neither of these situations are common, and the rescanned sequences are generally very short (typically one character), so the performance impact is trivial. But if you really worry about this, you can provide the --backup
option to Flex to ask it to prepare a report of backup states. If you manage to eliminate all the backup states, you will get a small performance boost, but most of the time this qualifies as premature optimization.