How can I find repeated words in a file using grep

I need to find repeated words in a file using egrep (or grep -e) in unix (bash)

I tried:

egrep "(\<[a-zA-Z]+\>) \1" file.txt

and

egrep "(\b[a-zA-Z]+\b) \1" file.txt

but for some reason these consider things to be repeats that aren't! for example, it thinks the string "word words" meets the criteria despite the word boundary condition \> or \b.

标签： regex bash unix grep word-boundary

4条回答

疯言疯语

2楼-- · 2019-06-22 14:41

I use

pcregrep -M '(\b[a-zA-Z]+)\s+\1\b' *

to check my documents for such errors. This also works if there is a line break between the duplicated words.

Explanation:

-M, --multiline run in multiline mode (important if a line break is between the duplicated words.
[a-zA-Z]+: Match words
\b: Word boundary, see tutorial
(\b[a-zA-Z]+) group it
\s+ match at least one (but as many more as necessary) whitespace characters. This includes newline.
\1: Match whatever was in the first group

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2019-06-22 14:49

This is the expected behaviour. See what man grep says:

The Backslash Character and Special Expressions

The symbols \< and > respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].

and then in another place we see what "word" is:

Matching Control

Word-constituent characters are letters, digits, and the underscore.

So this is what will produce:

$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) \1" a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) \1" a
hello and and bye
words words
this are words words
"words words"

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-06-22 15:01

egrep "(\<[a-zA-Z]+>) \<\1\>" file.txt

fixes the problem.

basically, you have to tell \1 that it needs to stay in word boundaries too

0人赞添加讨论(0) 举报

Lonely孤独者°

5楼-- · 2019-06-22 15:03

\1 matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b is inside the capture parentheses.

If you want the second instance to also be on a word boundary, you need to say so:

egrep "(\b[a-zA-Z]+) \1\b" file.txt

That is no different from:

egrep "\b([a-zA-Z]+) \1\b" file.txt

The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:

egrep "\<([a-zA-Z]+)\> \<\1\>" file.txt

0人赞添加讨论(0) 举报

How can I find repeated words in a file using grep

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间