Flex and terminating state machine for reading str

2019-09-06 13:53发布

问题:

My flex file is given below. Beyond trivial symbols, it defines a state machine to read strings. So it starts whenever it encounters an " and terminates on locating a following ". Now when I feed this flex file an input with two strings followed by each other like this:

"this" "apple"

It correctly identifies this but fails to find apple. Why is this current behavior happening? I have put in BEGIN(INITIAL) identifier but it does not work.

/* sample simple scanner 
*/
%{    
int num_lines = 0;
#define CLASS 10
#define LAMBDA 1
#define DOT    2
#define PLUS   3
#define OPEN   4
#define CLOSE  5
#define NUM    6
#define ID     7
#define INVALID 8
#define MAX_STR_CONST 256;
#define COMMENT 11;

char string_buf[256];
char *string_buf_ptr;

char string_buf_cmnt[256];
char *string_buf_ptr_cmnt;
 int size = 0;
%}     
%x str
%x comment1
%x comment2
%%


\"     {
  string_buf_ptr = (char*)malloc(8); size = 0; BEGIN(str);}
<str>\"        {           /* saw closing quote - all done */
  /* return string constant token type and
   * value to parser
   */

  *string_buf_ptr = '\0';  /* apppend the end of string with null */

  string_buf_ptr = string_buf_ptr - size; /* scale back string ptr to start */

  int i = 0;

  for (; i < size; i++){
    yytext[i]=*(string_buf_ptr + i); /* copy each character to yytext */
  }

  yytext[i]='\0';             /* apppend the end of string with null */
  free(string_buf_ptr);

  BEGIN(INITIAL);            /* go back to start */
  return ID;
 }
<str>\n        {
  /* error - unterminated string constant */
  /* generate error message */
  //printf("error is here\n"); 
 }
<str>\\0        ;
<str>\\[0-7]{1,3} {
  /* octal escape sequence */
  int result;
  (void) sscanf( yytext + 1, "%o", &result );
  if (result == 0x00){
     *string_buf_ptr++ = '0';
  } else {
    if ( result > 0xff ){
      /* error, constant is out-of-bounds */}
    else{*string_buf_ptr++ = result;}
  }
       size++;
 }
<str>\\[0-9]+ {
  /* generate error - bad escape sequence; something
   * like '\48' or '\0777777'
   */
 }
<str>\\n  *string_buf_ptr++ = '\n';  size++; 
<str>\\t  *string_buf_ptr++ = '\t';  size++;
<str>\\r  *string_buf_ptr++ = '\r';  size++;
<str>\\b  *string_buf_ptr++ = '\b';  size++;
<str>\\f  *string_buf_ptr++ = '\f';  size++;
<str>\\a  *string_buf_ptr++ = '\a';  size++;

<str>\\(.|\n)  *string_buf_ptr++ = yytext[1];  size++;  

<str>[^\\\n\"]+        {
  //printf("there\n");
  char *yptr = yytext;
  int i = 0;
  while ( *yptr )
    {
      *string_buf_ptr++ = *yptr++;
      yytext[i] = *(string_buf_ptr-1);
      size++;
      i++;
    }
}
[ ]+     //printf("space\n");
%%


main(int argc, char **argv) {
  int res;
  yyin = stdin;

  while(res = yylex()) {  
    printf("class: %d lexeme: %s line: %d\n", res, yytext, num_lines); 
  }
} 

回答1:

You can't overwrite yytext like that. yytext is not guaranteed to point at usable memory beyond the current token, and anyway you're not allowed to modify yytext outside of the current token.

So what's happening is that you end up copying this over top of the pending input, which overwrites the " which starts the second string. So it's not going to be recognized as a string.

Instead of overwriting yytext, just make your string_buf_ptr visible to the caller of yylex by either making it a global variable or passing a pointer to a return value as an extra argument to the lexer (see the YY_DECL macro). Of course, that will force you to change your memory management strategy, but your current memory management won't work either since some tokens are likely to be more than seven characters long.

Personally, I'd avoid the global, and keep a static char* which can be passed back to the caller via an out parameter. Then you can require that the caller make a copy of the string if they need to keep it beyond the next call to yylex. You could insist that the caller free the string, but the advantage of the "caller copies" strategy is that no copy will be made if the caller doesn't need to persist the string. This is precisely the strategy used with yytext; yytext will be destroyed by the next call to yylex so a caller needing to persist the token's value needs to make a copy of yytext.