Regex clarification on escape sequences with lex

2019-07-11 18:41发布

问题:

I'm creating a lexer.l file that is working as intended except for one part. I have the rule:

[\(\*.*\*\)] {}

which I want to make it so when I encounter (* this is a test *) in a file, I simply do nothing with it. However when I run lex lexer.l I get warning on lines with rules \(, \*, and \) stating that they can never be met. So I guess my question is why would [\(\*.*\*\)] {} interfere with \( and the others? How can I catch (* this is a test *)?

回答1:

Languages with the comment syntax (*…*) typically allow nested comments, and nested comments cannot easily be recognized by (f)lex because the nesting requires a context-free grammar, and the lexical scanner only implements regular languages.

If your comments do not nest (so that (* something (* else *) is a comment, rather than the prefix of a longer comment), then you can use the regular expression

[(][*][^*]*[*]+([^*)][^*]*[*]+)*[)]

If you do require nested comments, you can use start conditions and a stack (or a simulated stack, as below):

%x SC_COMMENT

%%
  int comment_nesting = 0;

"(*"             { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
  "(*"           { ++comment_nesting; }
  "*"+")"        { if (comment_nesting) --comment_nesting;
                   else BEGIN(INITIAL); }
  "*"+           ; 
  [^(*\n]+       ;
  [(]            ; 
  \n             ; 
}

That snippet was taken from this answer, with a small adjustment because that answer recognizes nested /*…*/ comments. A fuller explanation of the code appears there.