I was trying to remove all comments and empty lines in a file with the help of a macro. Now I came up with this solution which deletes the comments(there is some bug described below) but is not able to delete the blank lines in between -
Sub CleanCode()
Dim regexComment As String = "(REM [\d\D]*?[\r\n])|(?<SL>\'[\d\D]*?[\r\n])"
Dim regexBlank As String = "^[\s|\t]*$\n"
Dim replace As String = ""
Dim selection As EnvDTE.TextSelection = DTE.ActiveDocument.Selection
Dim editPoint As EnvDTE.EditPoint
selection.StartOfDocument()
selection.EndOfDocument(True)
DTE.UndoContext.Open("Custom regex replace")
Try
Dim content As String = selection.Text
Dim resultComment As String = System.Text.RegularExpressions.Regex.Replace(content, regexComment, replace)
Dim resultBlank As String = System.Text.RegularExpressions.Regex.Replace(resultComment, regexBlank, replace)
selection.Delete()
selection.Collapse()
Dim ed As EditPoint = selection.TopPoint.CreateEditPoint()
ed.Insert(resultBlank)
Catch ex As Exception
DTE.StatusBar.Text = "Regex Find/Replace could not complete"
Finally
DTE.UndoContext.Close()
DTE.StatusBar.Text = "Regex Find/Replace complete"
End Try
End Sub
So, here is what it should looks like before and after running the macro.
BEFORE
Public Class Class1
Public Sub New()
''asdasdas
Dim a As String = "" ''asdasd
''' asd ad asd
End Sub
Public Sub New(ByVal strg As String)
Dim a As String = ""
End Sub
End Class
AFTER
Public Class Class1
Public Sub New()
Dim a As String = ""
End Sub
Public Sub New(ByVal strg As String)
Dim a As String = ""
End Sub
End Class
There are mainly two main problems with the macro
- It cannot delete the blank lines in between.
- If there is a piece of code which goes like this
Dim a as String = "Name='Soham'"
Then After running the macro it becomes
Dim a as String = "Name='"
To get rid of a line that contains whitespace or nothing, you can use this regex:
Your regex,
^[\s|\t]*$\n
would work if you specified Multiline mode ((?m)
), but it's still incorrect. For one thing, the|
matches a literal|
; there's no need to specify "or" in a character class. For another,\s
matches any whitespace character, including TAB (\t
), carriage-return (\r
), and linefeed (\n
), making it needlessly redundant and inefficient. For example, at the first blank line (after the end of the firstSub
), the^[\s|\t]*
will initially try to match everything before the wordPublic
, then it will back off to the end of the previous line, where the$\n
can match.But a blank line, in addition to being empty or containing only horizontal whitespace (spaces or TABs), may also contain a comment. I choose to treat these "comment-only" lines as blank lines because it's relatively easy to do, and it simplifies the task of matching comments in non-blank lines, which is much harder. Here's my regex:
After consuming any leading horizontal whitespace, if I see a
REM
or'
signifying a comment, I consume that and everything after it until the next line separator. Notice that the only thing that's required to be present is the line separator itself. Also notice the absence of the end anchor,$
. It's never necessary to use that when you're explicitly matching the line separators, and in this case it would break the regex. In Multiline mode,$
matches only before a linefeed (\n
), not before a carriage-return (\r
). (This behavior of the .NET flavor is incorrect and rather surprising, given Microsoft's longstanding preference for\r\n
as a line separator.)Matching the remaining comments is a fundamentally different task. As you've discovered, simply searching for
REM
or'
is no good because you might find it in a string literal, where it does not signify the start of a comment. What you have to do is start from the beginning of the line, consuming and capturing anything that's not the beginning of a comment or a string literal. If you find a double-quote, go ahead and consume the string literal. If you find aREM
or'
, stop capturing and go ahead and consume the rest of the line. Then you replace the whole line with just the captured portion--i.e., everything before the comment. Here's the regex:Or, more readably:
The replacement string would be
"${line}"
. Some other notes:[\r\n]+
to consume the line separator, like the "blank lines" regex does.$
either, for the same reason as before. The[^\r\n]*
will greedily consume everything before the line separator, so the anchor isn't needed.REM
or'
; we don't bother matching any line that doesn't contain a comment.(...)
instead of(?:...)
for all the groups I don't want to capture, but the named group,(?<line>...)
, still works.I don't do VB, but here's a demo in C#.
Delete the comments first using this regex
'+\s*(\W|\w).+
'+ - one or more ' for the beginning of each comment.
\s* - if there are spaces after the comment.
(\W|\w).+ - anything that follows except for line terminators.
Then remove the blank lines left using the regex Mr. Alan Moore provided.
I've just checked with the two examples from above,
'+{.+}$
should do. Optionally, you could go with('|'')+{.+}$
but the first solution also replaces the xml-descriptions ).Edit: if you use
('+{.+}$|^$\n)
it deletes a) all comments and b) all empty lines. However, if you have a comment and a End Sub/Function following, it takes it up one line which results in a compiler error.Before
After
Edit: To delete any empty lines Search Replace the following regex
^$\n
with empty.