Commenting Regular Expressions

I'm trying to comment regular expressions in JavaScript.

There seems to be many resources on how to remove comments from code using regex, but not actually how to comment regular expressions in JavaScript so they are easier to understand.

Any help is greatly appreciated!

回答1:

Unfortunately, JavaScript doesn't have a verbose mode for regular expression literals like some other langauges do. You may find this interesting, though.

In lieu of any external libraries, your best bet is just to use a normal string and comment that:

var r = new RegExp(
    '('      + //start capture
    '[0-9]+' + // match digit
    ')'        //end capture
); 
r.test('9'); //true

回答2:

In several other languages (notably Perl), there's the special x flag. When set, the regexp ignores any whitespace and comments inside of it. Sadly, javascript regexps do not support the x flag.

Lacking syntax, the only way to leverage readability is convention. Mine is to add a comment before the tricky regular expression, containing it as if you've had the x flag. Example:

/*
  \+?     #optional + sign
  (\d*)   #the integeric part
  (       #begin decimal portion
     \.
     \d+  #decimal part
  )
 */
var re = /\+?(\d*)(\.\d+)/;

For more complex examples, you can see what I've done with the technique here and here.

回答3:

While Javascript doesn't natively support multi-line and commented regular expressions, it's easy enough to construct something that accomplishes the same thing - use a function that takes in a (multi-line, commented) string and returns a regular expression from that string, sans comments and newlines.

The following snippet imitates the behavior of other flavors' x ("extended") flag, which ignores all whitespace characters in a pattern as well as comments, which are denoted with #:

function makeExtendedRegExp(inputPatternStr, flags) {
  // Remove everything between the first unescaped `#` and the end of a line
  // and then remove all unescaped whitespace
  const cleanedPatternStr = inputPatternStr
    .replace(/(^|[^\\])#.*/g, '$1')
    .replace(/(^|[^\\])\s+/g, '$1');
  return new RegExp(cleanedPatternStr, flags);
}


// The following switches the first word with the second word:
const input = 'foo bar baz';
const pattern = makeExtendedRegExp(String.raw`
  ^       # match the beginning of the line
  (\w+)   # 1st capture group: match one or more word characters
  \s      # match a whitespace character
  (\w+)   # 2nd capture group: match one or more word characters
`);
console.log(input.replace(pattern, '$2 $1'));

Ordinarily, to represent a backslash in a Javascript string, one must double-escape each literal backslash, eg str = 'abc\\def'. But regular expressions often use many backslashes, and the double-escaping can make the pattern much less readable, so when writing a Javascript string with many backslashes it's a good idea to use a String.raw template literal, which allows a single typed backslash to actually represent a literal backslash, without additional escaping.

Just like with the standard x modifier, to match an actual # in the string, just escape it first, eg

foo\#bar     # comments go here

// this function is exactly the same as the one in the first snippet

function makeExtendedRegExp(inputPatternStr, flags) {
  // Remove everything between the first unescaped `#` and the end of a line
  // and then remove all unescaped whitespace
  const cleanedPatternStr = inputPatternStr
    .replace(/(^|[^\\])#.*/g, '$1')
    .replace(/(^|[^\\])\s+/g, '$1');
  return new RegExp(cleanedPatternStr, flags);
}


// The following switches the first word with the second word:
const input = 'foo#bar baz';
const pattern = makeExtendedRegExp(String.raw`
  ^       # match the beginning of the line
  (\w+)   # 1st capture group: match one or more word characters
  \#      # match a hash character
  (\w+)   # 2nd capture group: match one or more word characters
`);
console.log(input.replace(pattern, '$2 $1'));

Note that to match a literal space character (and not just any whitespace character), while using the x flag in any environment (including the above), you have to escape the space with a \ first, eg:

^(\S+)\ (\S+)   # capture the first two words

If you want to frequently match space characters, this can get a bit tedious and make the pattern harder to read, similar to how double-escaping backslashes isn't very desirable. One possible (non-standard) modification to permit unescaped space characters would be to only strip out spaces at the beginning and end of a line, and spaces before a # comment:

function makeExtendedRegExp(inputPatternStr, flags) {
  // Remove the first unescaped `#`, any preceeding unescaped spaces, and everything that follows
  // and then remove leading and trailing whitespace on each line, including linebreaks
  const cleanedPatternStr = inputPatternStr
    .replace(/(^|[^\\]) *#.*/g, '$1')
    .replace(/^\s+|\s+$|\n/gm, '');
  console.log(cleanedPatternStr);
  return new RegExp(cleanedPatternStr, flags);
}


// The following switches the first word with the second word:
const input = 'foo bar baz';
const pattern = makeExtendedRegExp(String.raw`
  ^             # match the beginning of the line
  (\w+) (\w+)   # capture the first two words
`);
console.log(input.replace(pattern, '$2 $1'));

回答4:

I would suggest you to put a regular comment above the line with the regular expression in order to explain it.

You will have much more freedom.