Word boundary regex issue

2019-07-04 23:12发布

问题:

I'm having issues using word boundaries \b in my regular expression. I'm using R but the issue exists as well when I try http://regexr.com. The pattern I'm using is \bs\.l\.\b, and while I expected lines 1 and 3 below to match this pattern, only line 2 matches:

aaa s.l. bbb
aaa s.l.bbb
aaa s.l., bbb

See http://regexr.com/3f154 as well.

回答1:

The word boundaries match in the following positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Now, you want to match s.l. that is preceded with a word boundary, and not followed with a word char. You need to replace the trailing \b with a (?!\w) lookaround:

\bs\.l\.(?!\w)

See the regex demo

Use perl=TRUE if you are using base R functions, and it will work as is in stringr functions powered with ICU regex library.



回答2:

. is not a word character, so there is no word boundary between the . characters and the space or comma.