My flex file is given below. Beyond trivial symbols, it defines a state machine to read strings. So it starts whenever it encounters an "
and terminates on locating a following "
. Now when I feed this flex file an input with two strings followed by each other like this:
"this" "apple"
It correctly identifies this but fails to find apple. Why is this current behavior happening? I have put in BEGIN(INITIAL)
identifier but it does not work.
/* sample simple scanner
*/
%{
int num_lines = 0;
#define CLASS 10
#define LAMBDA 1
#define DOT 2
#define PLUS 3
#define OPEN 4
#define CLOSE 5
#define NUM 6
#define ID 7
#define INVALID 8
#define MAX_STR_CONST 256;
#define COMMENT 11;
char string_buf[256];
char *string_buf_ptr;
char string_buf_cmnt[256];
char *string_buf_ptr_cmnt;
int size = 0;
%}
%x str
%x comment1
%x comment2
%%
\" {
string_buf_ptr = (char*)malloc(8); size = 0; BEGIN(str);}
<str>\" { /* saw closing quote - all done */
/* return string constant token type and
* value to parser
*/
*string_buf_ptr = '\0'; /* apppend the end of string with null */
string_buf_ptr = string_buf_ptr - size; /* scale back string ptr to start */
int i = 0;
for (; i < size; i++){
yytext[i]=*(string_buf_ptr + i); /* copy each character to yytext */
}
yytext[i]='\0'; /* apppend the end of string with null */
free(string_buf_ptr);
BEGIN(INITIAL); /* go back to start */
return ID;
}
<str>\n {
/* error - unterminated string constant */
/* generate error message */
//printf("error is here\n");
}
<str>\\0 ;
<str>\\[0-7]{1,3} {
/* octal escape sequence */
int result;
(void) sscanf( yytext + 1, "%o", &result );
if (result == 0x00){
*string_buf_ptr++ = '0';
} else {
if ( result > 0xff ){
/* error, constant is out-of-bounds */}
else{*string_buf_ptr++ = result;}
}
size++;
}
<str>\\[0-9]+ {
/* generate error - bad escape sequence; something
* like '\48' or '\0777777'
*/
}
<str>\\n *string_buf_ptr++ = '\n'; size++;
<str>\\t *string_buf_ptr++ = '\t'; size++;
<str>\\r *string_buf_ptr++ = '\r'; size++;
<str>\\b *string_buf_ptr++ = '\b'; size++;
<str>\\f *string_buf_ptr++ = '\f'; size++;
<str>\\a *string_buf_ptr++ = '\a'; size++;
<str>\\(.|\n) *string_buf_ptr++ = yytext[1]; size++;
<str>[^\\\n\"]+ {
//printf("there\n");
char *yptr = yytext;
int i = 0;
while ( *yptr )
{
*string_buf_ptr++ = *yptr++;
yytext[i] = *(string_buf_ptr-1);
size++;
i++;
}
}
[ ]+ //printf("space\n");
%%
main(int argc, char **argv) {
int res;
yyin = stdin;
while(res = yylex()) {
printf("class: %d lexeme: %s line: %d\n", res, yytext, num_lines);
}
}
You can't overwrite
yytext
like that.yytext
is not guaranteed to point at usable memory beyond the current token, and anyway you're not allowed to modifyyytext
outside of the current token.So what's happening is that you end up copying
this
over top of the pending input, which overwrites the"
which starts the second string. So it's not going to be recognized as a string.Instead of overwriting
yytext
, just make yourstring_buf_ptr
visible to the caller ofyylex
by either making it a global variable or passing a pointer to a return value as an extra argument to the lexer (see theYY_DECL
macro). Of course, that will force you to change your memory management strategy, but your current memory management won't work either since some tokens are likely to be more than seven characters long.Personally, I'd avoid the global, and keep a
static char*
which can be passed back to the caller via anout
parameter. Then you can require that the caller make a copy of the string if they need to keep it beyond the next call toyylex
. You could insist that the callerfree
the string, but the advantage of the "caller copies" strategy is that no copy will be made if the caller doesn't need to persist the string. This is precisely the strategy used withyytext
;yytext
will be destroyed by the next call toyylex
so a caller needing to persist the token's value needs to make a copy ofyytext
.