Stripping multiline comments in C++ with RegEx

2019-09-07 10:15发布

问题:

Assume that I have a var std::string sourceCode; where I have loaded a cpp source file. Now I want to remove all comments with the included regex classes from tr1 (now they are fully included as I use the Microsoft compiler) - single-line is easy but multi-line not. It is not about just replacing a comment with a space etc. it's about to keep the correct number of lines. Assume we remove a comment which is 5 lines long, this space should then be filled with 5 newlines so that I am able to backtrack code and compute with the correct line numbers.

My code so far:

std::regex singleLinedCommentReg("//.*");
sourceCode = std::regex_replace(sourceCode, singleLinedCommentReg, std::string(""));
std::regex multiLinedCommentReg("(/\\*([^*]|[\r\n]|(\\*+([^*/]|[\r\n])))*\\*+/)");
std::for_each(
    std::sregex_iterator(sourceCode.begin(), sourceCode.end(), multiLinedCommentReg),
    std::sregex_iterator(),
    [&](const std::match_results<std::string::const_iterator>& match) -> bool {
        // TODO: Replace the current match with an appropriate number of newlines.
        return true;
    }
);

Can anyone give me some advice on that?

EDIT #1

I do NOT want to provoke comments about the discussion whether it makes sense to use RegEx for this kind of doing! Please just assume the input is clean and as expected.

回答1:

Your approach using regex is way off and too complicated. You are trying to use a regular language (regex) to parse a situation that is at least as complex as a context-free grammar. If you split things up and do part of the processing in C++ you'll get it done but it'll look messy.

If your goal is to write a function that strips all of the comments out without losing the new line characters I suggest that you generate a parse using one of the many parsing tools available.

This took less than 5 minutes to create and is functionally what you are looking for. You can modify this to your hearts content. It will generate a lexer with flex 2.5.4 or flex 2.5.35

%{
    #include <stdio.h>
%}


cbeg    "/*"
cend    "*/"
cppc    "//"
nl  "\n"|"\r\n"

%option noyywrap
%x mlc 
%%
{nl}        { fputs(yytext, stdout); }
{cbeg}      { BEGIN(mlc); }
{cend}      { fprintf(stderr, "Error: found end of comment without a beginning\n"); return -1; }
{cppc}.*    /* eat up the comment */
.       { fputs(yytext, stdout); }

<mlc>{cend} { BEGIN(INITIAL); }
<mlc>{cbeg}     { fprintf(stderr, "Error: Found /* inside another /* comment"); return -1; }
<mlc>.      /* eat up everything else */

%%

int main(int argc, char* argv[])
{
        yylex();
}

Addendum:

The above is a fully functional program. You can generate the .c using:

flex -t foo.l > foo.c

and you can compile it using

cc -o foo foo.c

Now something like

./foo < source.c > source-sans-comments.c 

will generate the new source file.



回答2:

The best approach would be to use two regexen. The first would remove all single-line comments (these would not affect the line numbers).

Then, use another regex for removing the multiline comments, and loop over each one until there are no more:

regex mlc("\\/\\*[^(\\/\\*)]*?\\*\\/");

string data = something;

match_results<std::string::const_iterator> searchresult;

while (regex_search(data, searchresult, mlc)) {
    const string& match = searchresult.str();

    auto newlinecount = std::count(match.begin(), match.end(), '\n');

    data.replace(searchresult.position(), match.length(), newlinecount, '\n');
}