I need to find repeated words in a file using egrep (or grep -e) in unix (bash)
I tried:
egrep "(\<[a-zA-Z]+\>) \1" file.txt
and
egrep "(\b[a-zA-Z]+\b) \1" file.txt
but for some reason these consider things to be repeats that aren't!
for example, it thinks the string "word words" meets the criteria despite the word boundary condition \>
or \b
.
I use
to check my documents for such errors. This also works if there is a line break between the duplicated words.
Explanation:
-M, --multiline
run in multiline mode (important if a line break is between the duplicated words.[a-zA-Z]+
: Match words\b
: Word boundary, see tutorial(\b[a-zA-Z]+)
group it\s+
match at least one (but as many more as necessary) whitespace characters. This includes newline.\1
: Match whatever was in the first groupThis is the expected behaviour. See what
man grep
says:and then in another place we see what "word" is:
So this is what will produce:
fixes the problem.
basically, you have to tell \1 that it needs to stay in word boundaries too
\1
matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the\b
is inside the capture parentheses.If you want the second instance to also be on a word boundary, you need to say so:
That is no different from:
The space in the pattern forces a word boundary, so I removed the redundant
\b
s. If you wanted to be more explicit, you could put them in: