-->

Removing nested comments bz lex

2019-06-20 13:04发布

问题:

How should I do program in lex (or flex) for removing nested comments from text and print just the text which is not in comments? I should probably somehow recognize states when I am in comment and number of starting "tags" of block comment.

Lets have rules:
1.block comment

/*
block comment
*/

2. line comment

// line comment

3. Comments can be nested.

Example 1

show /* comment /* comment */ comment */ show

output:

show  show

Example 2

show /* // comment
comment
*/
show

output:

show 
show 

Example 3

show
///* comment
comment
// /*
comment
//*/ comment
//
comment */
show

output:

show
show

回答1:

You got the theory right. Here's a simple implementation; could be improved.

%x COMMENT
%%
%{
   int comment_nesting = 0;
%}

"/*"            BEGIN(COMMENT); ++comment_nesting;
"//".*          /* // comments to end of line */

<COMMENT>[^*/]* /* Eat non-comment delimiters */
<COMMENT>"/*"   ++comment_nesting;
<COMMENT>"*/"   if (--comment_nesting == 0) BEGIN(INITIAL);
<COMMENT>[*/]   /* Eat a / or * if it doesn't match comment sequence */

  /* Could have been .|\n ECHO, but this is more efficient. */
([^/]*([/][^/*])*)* ECHO;  
%%


回答2:

This is exactly what you need : yy_push_state(COMMENT) Its uses a stack to store our states which comes handy in nested situations.



回答3:

I am afraid that @rici 's answer might be wrong. First we need to record line no and might change the file line directive later. Second giving open_sign and close_sign. We have following principles:

1) using an integer for stack control: push for open sign, popup for close sign
2) eat up CHARACTER BEFORE EOF and close sign WITHOUT open sign inside
<comments>{open} {no_open_sign++;}
<comments>\n {curr_lineno++;}
<comments>[^({close})({open})(EOF)] /*EAT characters by doing nothing*/
3) Errors might happen when no_open_sign down to zero, hence
<comments>{close}  similar as above post
4) EOF should not be inside the string, hence you need a rule
<comments>(EOF) {return ERROR_TOKEN;}

to make it more robust, you also need to have another close checking rule out side of

And in practice, you should use negative look before and look behind regular expression gramma if your lexical analyzer supports it.