I'm a regular expression newbie, and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern:
/(\b\S+)(?:\s+\1\b)+/
(Pattern Demo)Replace:
$1
(replaces the fullstring match with capture group #1)This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b
(word boundary) characters are vital to ensure partial words are not matched.+
(one or more quantifier) on the non-capturing group is more appropriate than*
because*
will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
Here is one that catches multiple words multiple times:
I believe this regex handles more situations:
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
Use this in case you want case-insensitive checking for duplicate words.
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.