So I need a regular expression for finding single line and multi line comments, but not in a string. (eg. "my /* string"
)
for testing (#
single line, /*
& */
multi line):
# complete line should be found
lorem ipsum # from this to line end
/*
all three lines should be found
*/ but not here anymore
var x = "this # should not be found"
var y = "this /* shouldn't */ match either"
var z = "but" & /* this must match */ "_"
SO does the syntax display really well; I basically want all the gray text.
I don't care if its a single regex or two separates. ;)
EDIT: one more thing. the opposite would also satisfy me, searching for a string which is not in a comment
this is my current string matching: "[\s\S]*?(?<!\\)"
(indeed: will not work with "\\"
)
EDIT2:
OK finally I wrote my own comment parser -.-
And if someone else is interested in the source code, grab it from here: https://github.com/relikd/CommentParser
Here's one possibility (it does have an achilles heel that i'll get to):
(#[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n]|/\*([^*]|\*(?!/))*?\*/)(?=[^"]*(?:"[^"]*"[^"]*)*$)
In action here
With the GLOBAL and DOTALL flags, but not the MULTILINE flag.
Explanation of the regex:
(
#[^"\n\r]* Hash mark followed by non-" and non-end-of-line
(?:"[^"\n\r]*"[^"\n\r]*)* If any quotes in the comment, they must be balanced
[\r\n] Followed by end-of-line ($ except we
don't have multiline flag)
| OR
/\*([^*]|\*(?!/))*?\*/ /* xxx */ sort of comment
) BOTH FOLLOWED BY
(?=[^"]*(?:"[^"]*"[^"]*)*$) only a *balanced* number of quotes for the
*rest of the code :O!*
However, this relies on balanced quotes being used throughout the text (it also doesn't take into account escaped quotes, but it's easy enough to modify the regex to take that into account).
If a user has a comment with a " in it that isn't balanced...boom. You're screwed!
Regex is generally not recommended by things like HTML/code parsing, but if you can rely on the fact that quotes have to balance when you define a string, etc, you can sometimes get away with it.
Since you are also parsing comments, which have no set structure (ie you are not guaranteed that quotes within comments will be balanced), you won't be able to find a regex solution that works here.
Anything you think up can be outwitted by an unbalanced quote in a comment somewhere (say the comment was # remove all the " marks
), or by multiline strings (where on a given line there may be unbalanced quotes).
Bottom line - you can probably make a regex that will work in most cases, but not for all. To get something watertight you'll have to write some code.
I would use two regular expressions for this:
/(\/\*.*?\/)|(#.+?$)/m
to find all the comments, the "m" modifier is to enable multiline
/"[^"]*?"/
to find all the strings
If you apply the highlighting to the comments first and only after to the strings, the invalid comments should disappear.