My lex pattern doesn't work to match my input

2019-09-08 12:26发布

问题:

I've got a simple pattern to match: head+content+tail, I've got a lex file like below:

$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%

I hope when meeting "12" it will print "head", when meet "34" it will print "tail", any other contiguous string, it will print "content".

So I compile and run it:

lex b.l && gcc lex.yy.c -ll
$ echo '12sdaesre34'|a.out
content

My expectation is, it will print

head
content
tail

But actually it prints only "content" line. Did I get anything wrong, how to correct it?

Thanks!

回答1:

(F)lex always matches the longest possible token. Since .* will match any sequence which doesn't contain a newline character, it will happily match 12sdaesre34. (In (f)lex, . matches any character other than newline.) Thus the 34 is no longer available to be matched.

To fix it, you have to be clear about what you want content to match. For example, the following will match anything which doesn't contain a digit:

[^[:digit:]]+   { printf("content\n"); }

You might want to add newline to the list of characters to not match:

[^\n[:digit:]]+   { printf("content\n"); }

Or perhaps you want to match the longest sequence not containing 34. That's trickier but it can be done:

([^3]|3+[^34])+   { printf("content\n"); }

However, that will still match initial 12, so it won't be enough to solve the problem.

If your input always consists of strings of the form 12...34 possibly interspersed with other content, you can match the entire 12...34 sequence and split it into three tokens. That's undoubtedly the simplest solution, since the beginning and end markers are of a known length. The first of the following patterns matches a string which doesn't start 12, ending just before the first instance of 12, and the second one matches a string starting 12 and ending at the first instance of 34 (which is matched). Neither of the patterns will match an input which contains an unmatched 12; so a third rule is added to match that case; it looks a lot like the second rule but doesn't include the match for 34 at the end. Because (f)lex always matches the longest possible token, the third rule will only succeed if the second rule fails.

([^1]|1+[^12])*         { puts("content"); }
12([^3]|3+[^34])*34     { puts("head content tail"); }
12([^3]|3+[^34])*       { puts("error"); }

Normally, you would want to actually capture the value of content to pass on to the calling program. In the first rule, that is just yytext, but in the second rule the content consists of yyleng-4 characters starting at yytext+2 (in order to remove the leading and trailing delimiters).

For most purposes, it is necessary to copy the matched token if you need to keep it, because yytext points into an internal data structure used by the lexical scanner and the pointer will be invalidated by the next pattern match. In the case of the first rule, you could create a copy of the string using strcpy, but for the second rule, you'd want to make the copy yourself:

([^1]|1+[^12])*         { yylval = strcpy(yytext); ... }
12([^3]|3+[^34])*34     { yylval = malloc(yyleng-3);
                          memcpy(yylval, yytext, yyleng-4);
                          yylval[yyleng-4] = '\0';
                          ...
                        }

Those assume that yylval is a global variable of type char*, and that somewhere in the code you free() the string saved by the rule. They also assume that you do something with yylval in the omitted code (...), or that you return to the caller with an indication as to whether the head and tail were encountered.