Regex difference between word boundary end and edg

2019-02-27 01:41发布

问题:

The R help file for regex says

The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word

What is the difference between an end and an edge (of a word)?

回答1:

The difference between the \b and \< / \> is that \b can be used in PCRE regex patterns (when you specify perl=TRUE) and ICU regex patterns (stringr package).

> s = "no where nowhere"
> sub("\\<no\\>", "", s)
[1] " where nowhere"
> sub("\\<no\\>", "", s, perl=T) ## \> and \< do not work with PCRE
[1] "no where nowhere"
> sub("\\bno\\b", "", s, perl=T) ## \b works with PCRE
[1] " where nowhere"

> library(stringr)
> str_replace(s, "\\bno\\b", "")
[1] " where nowhere"
> str_replace(s, "\\<no\\>", "")
[1] "no where nowhere"

The advantage of \< (always stands for the beginning of a word) and \> (always matches the end of a word) is that they are unambiguous. The \b may match both positions.

One more thing to consider (refrence):

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).