I would like to match the following expression in bash:
^.*(\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\b).*$
Really all I want to know is whether one of the words of the string tested is one of the words described in this regex (720p
, 1080p
, brrip
, ...). And there seems to be an issue with the word boundaries.
The test I use is [[ $name =~ $re ]] && echo "yes"
where $name
is any string and $re
is my regex expression.
What am I missing?
The accepted answer
is erroneousmay be erroneous on two minor points:\b
and '\<|>' (word boundary matching) is not a PCRE innovation. Then again, I am unable to trace the introduction of word boundary matching in RE engines, so it may as well have been Perl.That said, this answer is very specific to
Linux
builds ofBash
(with a finalMacOSX
specific section, which may apply to all BSD derivatives as well).By definition, GNU
Regular Expressions
(RE
) supports both\b
and\<|\>
as word boundary(grep
syntax). It is not a Perl Compatible Regular Expression extension, AFAIK. [1]Bash
has supported GNU ExtendedRE
(grep -E
syntax) since3.0
. [2]Thus for all versions of
Bash >= 3.0
,[[ " h " =~ '\bh\b' ]] && echo yes || echo no
should give meyes
. It does not (see the next points).In
Bash
versions3.0
through3.1
,[[ " h " =~ '\bh\b' ]] && echo yes || echo no
will give meyes
. Notice that the pattern itself is theright hand side
(RHS
) argument of the=~
operator. [2]Bash-3.2
changed quoting rules for the match operator=~
. [2]Since
Bash-3.2
, the pattern should ideally be stored in a variable and the variable should be supplied as theRHS
argument to the=~
operator:pat='\bh\b' ; [[ " h " =~ $pat ]] && echo yes || echo no
. The reason is that the quoting rules changed, so that if the pattern is supplied inside quotes(''
or""
), the pattern is interpreted as a string instead of a regex. [2]Finally, your pattern is correct, it's just a weird quoting issue:
Further, for
Bash
onMacOSX
, the boundary match changes from\b
to'[[:<:]]
(start of word) and[[:>:]]
(end of word) [3]:References:
[1] GNU grep Manual: Regex section
[2] The Bash FAQ, by it's Author
[3] MacOSX manpage for re_format
\b
is a PCRE extension; it isn't available in POSIX ERE (Extended Regular Expressions), which is the smallest possible set of syntax that the=~
operator in bash's[[ ]]
will honor. (An individual operating system may have a libc which extends this syntax; in this case those extensions will be available on such operating systems, but not on all platforms where bash is supported).As a baseline, the
\b
extension doesn't actually have very much expressive power -- you can write any PCRE that uses it as an equivalent ERE. Better, though, is to step back and question the underlying assumptions: When you say "word boundary", what do you really mean? If all you care about is that if this starts and ends either with whitespace or the beginning or end of the string, then you don't need the\b
operator at all:Note that I took out the initial
^.*
and ending.*$
, since those constructs are self-negating when doing an otherwise-unanchored match; the.*
makes the^
that immediately precedes it meaningless, and likewise the.*
just before the final$
.Now, if you want an exact equivalent to
\b
when placed immediately before a word character at the beginning of a sequence, then we get something more like:...and, likewise, when immediately after a word character at the end of a sequence:
Both of these are somewhat degenerate cases -- there are other situations where emulating the behavior of
\b
in ERE can be more complicated -- but they're the only situations your question appears to present.Note that some implementations of
\b
would have better support for non-ASCII character sets, and thus be better described with[^[:alnum:]_]
rather than[^a-zA-Z0-9_]
, but it's not well-defined here which implementation you're coming from or comparing against.