Regular Expression For Consecutive Duplicate Words

2019-01-01 05:53发布

站内文章 / 移动开发

37 0

伤终究还是伤i

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I\'m a regular expression newbie, and I can\'t quite figure out how to write a single regular expression that would \"match\" any duplicate consecutive words such as:

Paris in the the spring.

Not that that is related.

Why are you laughing? Are my my regular expressions THAT bad??

Is there a single regular expression that will match ALL of the bold strings above?

回答1:

Try this regular expression:

\\b(\\w+)\\s+\\1\\b

Here \\b is a word boundary and \\1 references the captured match of the first group.

回答2:

I believe this regex handles more situations:

/(\\b\\S+\\b)\\s+\\b\\1\\b/

A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

回答3:

The widely-used PCRE library can handle such situations (you won\'t achieve the the same with POSIX-compliant regex engines, though):

(\\b\\w+\\b)\\W+\\1

回答4:

Try this with below RE

\\b start of word word boundary
\\W+ any word character
\\1 same word matched already
\\b end of word

()* Repeating again

public static void main(String[] args) {

    String regex = \"\\\\b(\\\\w+)(\\\\b\\\\W+\\\\b\\\\1\\\\b)*\";//  \"/* Write a RegEx matching repeated words here. */\";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);

    Scanner in = new Scanner(System.in);

    int numSentences = Integer.parseInt(in.nextLine());

    while (numSentences-- > 0) {
        String input = in.nextLine();

        Matcher m = p.matcher(input);

        // Check for subsequences of input that match the compiled pattern
        while (m.find()) {
            input = input.replaceAll(m.group(0),m.group(1));
        }

        // Prints the modified sentence.
        System.out.println(input);
    }

    in.close();
}

回答5:

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

回答6:

The example in Javascript: The Good Parts can be adapted to do this:

var doubled_words = /([A-Za-z\\u00C0-\\u1FFF\\u2800-\\uFFFD]+)\\s+\\1(?:\\s|$)/gi;

\\b uses \\w for word boundaries, where \\w is equivalent to [0-9A-Z_a-z]. If you don\'t mind that limitation, the accepted answer is fine.

回答7:

This is the regex I use to remove duplicate phrases in my twitch bot:

(\\S+\\s*)\\1{2,}

(\\S+\\s*) looks for any string of characters that isn\'t whitespace, followed whitespace.

\\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

回答8:

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don\'t:

/(^|\\s+)(\\S+)(($|\\s+)\\2)+/g, \"$1$2\")

I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)

First, I put (^|\\s+) to make sure it starts with a full word, otherwise \"child\'s steak\" would go to \"child\'steak\" (the \"s\"\'s would match). Then, it matches all full words ((\\b\\S+\\b)), followed by an end of string ($) or a number of spaces (\\s+), the whole repeated more than once.

I tried it like this and it worked well:

var s = \"here here here     here is ahi-ahi ahi-ahi ahi-ahi joe\'s joe\'s joe\'s joe\'s joe\'s the result result     result\";
print( s.replace( /(\\b\\S+\\b)(($|\\s+)\\1)+/g, \"$1\"))         
--> here is ahi-ahi joe\'s the result

回答9:

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I\'ll show the adapted pattern.

Pattern: /(\\b\\S+)(?:\\s+\\1\\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)

This pattern greedily matches a \"whole\" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).

Specifically:

\\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will \"bother\" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.

*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

回答10:

Here is one that catches multiple words multiple times:

(\\b\\w+\\b)(\\s+\\1)+

回答11:

Use this in case you want case-insensitive checking for duplicate words.

(?i)\\\\b(\\\\w+)\\\\s+\\\\1\\\\b

回答12:

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Try this regex that can catch 2 or more duplicates words and only leave behind one single word. And the duplicate words need not even be consecutive.

/\\b(\\w+)\\b(?=.*?\\b\\1\\b)/ig

Here, \\b is used for Word Boundary, ?= is used for positive lookahead, and \\1 is used for back-referencing.

Example Source

标签： regex duplicates capture-group

伤终究还是伤i

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~