Can anyone explain the difference between \w
and \b
regular expression meta-characters?
It is my understanding that both these meta-characters are used for word boundaries. Apart from this, which meta character is efficient for multi lingual content?
The metacharacter
\b
is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.There are three different positions that qualify as word boundaries:
Simply put:
\b
allows you to perform a "whole words only" search using a regular expression in the form of\bword\b
. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".In all flavors, the characters
[a-zA-Z0-9_]
are word characters. These are also matched by the short-hand character class\w
. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.\w
stands for "word character", usually[A-Za-z0-9_]
. Notice the inclusion of the underscore and digits.\B
is the negated version of\b
.\B
matches at every position where\b
does not. Effectively,\B
matches at any position between two word characters as well as at any position between two non-word characters.\W
is short for[^\w]
, the negated version of\w
.Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
I found this to be a very useful book:
\w
matches a word character.\b
is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)\w
matchesa
,b
,c
,d
,e
, andf
in"abc def"
\b
matches the (zero-width) position beforea
, afterc
, befored
, and afterf
in"abc def"
See: http://www.regular-expressions.info/reference.html/
\w
is not a word boundary, it matches any word character, including underscores:[a-zA-Z0-9_]
.\b
is a word boundary, that is, it matches the position between a word and a non-alphanumeric character:\W
or[^\w]
.These implementations may vary from language to language though.
@Mahender, you probably meant the difference between
\W
(instead of\w
) and\b
. If not, then I would agree with @BoltClock and @jwismar above. Otherwise continue reading.\W
would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line.\b
is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here)\b
can be thought of as(\W|^|$)
. [Edit: as @Ωmega mentions below,\b
is a zero-length match so(\W|^|$)
is not strictly correct, but hopefully helps explain the diff]Quick example: For the string
Hello World
,.+\W
would matchHello_
(with the space) but will not matchWorld
..+\b
would match bothHello
andWorld
.