I've got a simple pattern to match: head+content+tail, I've got a lex file like below:
$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%
I hope when meeting "12" it will print "head", when meet "34" it will print "tail", any other contiguous string, it will print "content".
So I compile and run it:
lex b.l && gcc lex.yy.c -ll
$ echo '12sdaesre34'|a.out
content
My expectation is, it will print
head
content
tail
But actually it prints only "content" line. Did I get anything wrong, how to correct it?
Thanks!
(F)lex always matches the longest possible token. Since
.*
will match any sequence which doesn't contain a newline character, it will happily match12sdaesre34
. (In (f)lex,.
matches any character other than newline.) Thus the34
is no longer available to be matched.To fix it, you have to be clear about what you want
content
to match. For example, the following will match anything which doesn't contain a digit:You might want to add newline to the list of characters to not match:
Or perhaps you want to match the longest sequence not containing
34
. That's trickier but it can be done:However, that will still match initial
12
, so it won't be enough to solve the problem.If your input always consists of strings of the form
12...34
possibly interspersed with other content, you can match the entire12...34
sequence and split it into three tokens. That's undoubtedly the simplest solution, since the beginning and end markers are of a known length. The first of the following patterns matches a string which doesn't start12
, ending just before the first instance of12
, and the second one matches a string starting12
and ending at the first instance of34
(which is matched). Neither of the patterns will match an input which contains an unmatched12
; so a third rule is added to match that case; it looks a lot like the second rule but doesn't include the match for34
at the end. Because (f)lex always matches the longest possible token, the third rule will only succeed if the second rule fails.Normally, you would want to actually capture the value of
content
to pass on to the calling program. In the first rule, that is justyytext
, but in the second rule the content consists ofyyleng-4
characters starting atyytext+2
(in order to remove the leading and trailing delimiters).For most purposes, it is necessary to copy the matched token if you need to keep it, because
yytext
points into an internal data structure used by the lexical scanner and the pointer will be invalidated by the next pattern match. In the case of the first rule, you could create a copy of the string usingstrcpy
, but for the second rule, you'd want to make the copy yourself:Those assume that
yylval
is a global variable of typechar*
, and that somewhere in the code youfree()
the string saved by the rule. They also assume that you do something withyylval
in the omitted code (...
), or that you return to the caller with an indication as to whether the head and tail were encountered.