Using mgsub function with word boundaries for repl

I am trying to replace substrings of string elements within a vector with blank spaces. Below are the vectors we are considering:

test <- c("PALMA DE MALLORCA", "THE RICH AND THE POOR", "A CAMEL IN THE DESERT", "SANTANDER SL", "LA")

lista <- c("EL", "LA", "ES", "DE", "Y", "DEL", "LOS", "S.L.", "S.A.", "S.C.", "LAS",
       "DEL", "THE", "OF", "AND", "BY", "S", "L", "A", "C", "SA", "SC", "SL")

Then if we apply the mgsub function as it is, we get the following output:

library(qdap)
mgsub(lista, "", test)
# [1] "PM MOR"   "RIH POOR" "M IN ERT" "NTER"     ""

So I change my list to the following and reexecute:

lista <- paste("\\b", lista, "\\b", sep = "")
mgsub(lista, "", test)
# [1] "PALMA DE MALLORCA"     "THE RICH AND THE POOR" "A CAMEL IN THE DESERT"
# [4] "SANTANDER SL"          "LA"

I cannot get the word boundary regex to work for this function.

标签： regex r qdap character-replacement

1条回答

Fickle 薄情

2楼-- · 2019-07-10 05:17

According to multigsub {qdap} documentation:

mgsub(pattern, replacement = NULL, text.var, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = TRUE, ...)
...
fixed
logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

To make sure your vector of search terms is parsed as regular expressions, you need to "manually" set the fixed parameter to FALSE.

Another important note: the word boundary set after . requires a word character after it (or end of line). It is safer to use (?!\w) subpattern in this case. To use look-arounds in R regex, you need to use Perl-like regex. Thus, I suggest using this (if a non-word character can appear only at the end of the regex):

lista <- paste("\\b", lista, "(?!\\w)", sep = "")

or (if there can be a non-word character at the beginning, too):

lista <- paste("(?<!\\w)", lista, "(?!\\w)", sep = "")

and then

mgsub(lista, "", test, fixed=FALSE, perl=TRUE)

0人赞添加讨论(0) 举报

Using mgsub function with word boundaries for repl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间