What is a word boundary in regexes?

2018-12-31 06:22发布

I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of \b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.

Example:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

This returns:

true
false
true

11条回答
姐姐魅力值爆表
2楼-- · 2018-12-31 06:43

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.

One possible alternative is

(?:(?:^|\s)-?)\d+\b

This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

查看更多
素衣白纱
3楼-- · 2018-12-31 06:43

Word boundary \b is used where one word should be a word character and another one a non-word character. Regular Expression for negative number should be

--?\b\d+\b

check working DEMO

查看更多
裙下三千臣
4楼-- · 2018-12-31 06:44

A word boundary can occur in one of three positions:

  1. Before the first character in the string, if the first character is a word character.
  2. After the last character in the string, if the last character is a word character.
  3. Between two characters in the string, where one is a word character and the other is not a word character.

Word characters are alpha-numeric; a minus sign is not. Taken from Regex Tutorial.

查看更多
骚的不知所云
5楼-- · 2018-12-31 06:46

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

查看更多
大哥的爱人
6楼-- · 2018-12-31 06:49

In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.

My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.

enter image description here

查看更多
登录 后发表回答