How to remove C-style comments from code

2020-02-06 17:24发布

问题:

I just read a new question here on SO asking basically the same thing as mine does in the title. That got me thinking - and searching the web (most hits pointed to SO, of course ;). So I thought -

There should be a simple regex capable of removing C-style comments from any code.

Yes, there are answers to this question/statement on SO, but the ones I found, the're all incomplete and/or overly complex.

So I started experimenting, and came up with one that works on all types of code I can imagine:

(?:\/\/(?:\\\n|[^\n])*\n)|(?:\/\*(?:\n|\r|.)*?\*\/)|(("|')(?:\\\\|\\\2|\\\n|[^\2])*?\2)

The first alternative checks for double slash // comments. The second for ordinary ones /* comment */. The third one is what I had trouble finding other regex'es dealing with the same task handling - strings containing character sequences that outside the string, would be considered comments.

What this part does is to capture any strings in capture group one, matching the quote sign in capture group two, to quoted ones, up to the end of the string.

Capture group one should be kept in the replace, everything discarded (replaced for "") leaving un-commented code :).

Here's a C example at regex101.

OK... So that's not a question. It's an answer you think...

Yes, you're right. So... on to the question.

Have I missed any type of code that this regex would miss?

It handles

multi line comments

/*
    an easy one
*/

"end of line" comments

// Remove this

comments in strings

char array[] = "Following isn't a comment // because it's in a string /* this neither */";

which leads to - strings with escaped quotes

    char array[] = "Handle /* comments */ - // - in strings with \" escaped quotes";

and strings with escaped escapes

    char array[] = "Handle strings with **not** escaped quotes\\"; // <-EOS

javscript single quoted string

var myStr = 'Should also ignore enclosed // comments /* like these */ ';

line continuation

// This is a single line comment \
continuing on the next row (warns, but works in my C++ flavor)

So, can you think of any code cases messing this up? If you come up with any I'll try to complete the RE and hopefully it'll end up complete ;)

Regards.

PS. I know... Writing this it says in the right pane, under How to Ask: We prefer questions that can be answered, not just discussed. This question might violate that :S but I can't resist.

In fact, it may even turn out to be an answer, instead of a question, to some people. (Too cocky? ;)

回答1:

I've considered the comments (so far) and changed the regex to:

(?:\/\/(?:\\\n|[^\n])*\n)|(?:\/\*[\s\S]*?\*\/)|((?:R"([^(\\\s]{0,16})\([^)]*\)\2")|(?:@"[^"]*?")|(?:"(?:\?\?'|\\\\|\\"|\\\n|[^"])*?")|(?:'(?:\\\\|\\'|\\\n|[^'])*?'))

It handles Biffens C++11's raw string literal (as well as C# verbatim strings) and it's changed according to Wiktors suggestions.

Split it to handling single and double quotes separately because of difference in logic (and avoiding the non-working back reference ;).

It's undoubtedly more complex, but still far from the solutions I've seen out there which hardly cover any of the string issues. And it could be stripped of parts not applicable to a specific language.

One comment suggested supporting more languages. That would make the RE (even more) complex and unmanageable. It should be relatively easy to adapt though.

Updated regex101 example.

Thanks everyone for the input so far. And keep the suggestions coming.

Regards

Edit: Update Raw String - this time I actually read the spec. ;)