I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of \b
("word boundary"). I had assumed that -12
would be an "integer word" (matched by \b\-?\d+\b
) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true
Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.
I ran into an even worse problem when searching text for words like
.NET
,C++
,C#
, andC
. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class
\w
are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for\b
but not for\w
. (I'm sure there was a good reason for it at the time).The
\w
stands for "word character". It always matches the ASCII characters[A-Za-z0-9_]
. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode,\w
includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in\w
. But Java, JavaScript, and PCRE match only ASCII characters with\w
.Which is why Java-based regex searches for
C++
,C#
or.NET
(even when you remember to escape the period and pluses) are screwed by the\b
.Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the
\b
with before and after whitespace and punctuation designators. For example:Then in your test or main function:
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!
A word boundary, in most regex dialects, is a position between
\w
and\W
(non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]
).So, in the string
"-12"
, it would match before the 1 or after the 2. The dash is not a word character.I talk about what
\b
-style regex boundaries actually are here.The short story is that they’re conditional. Their behavior depends on what they’re next to.
Sometimes that isn’t what you want. See my other answer for elaboration.
when you use
\\b(\\w+)+\\b
that means exact match with a word containing only word characters([a-zA-Z0-9])
in your case for example setting
\\b
at the begining of regex will accept-12
(with space) but again it won't accept-12
(without space)for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.