-->

Using mgsub function with word boundaries for repl

2019-07-10 05:27发布

问题:

I am trying to replace substrings of string elements within a vector with blank spaces. Below are the vectors we are considering:

test <- c("PALMA DE MALLORCA", "THE RICH AND THE POOR", "A CAMEL IN THE DESERT", "SANTANDER SL", "LA")

lista <- c("EL", "LA", "ES", "DE", "Y", "DEL", "LOS", "S.L.", "S.A.", "S.C.", "LAS",
       "DEL", "THE", "OF", "AND", "BY", "S", "L", "A", "C", "SA", "SC", "SL")

Then if we apply the mgsub function as it is, we get the following output:

library(qdap)
mgsub(lista, "", test)
# [1] "PM MOR"   "RIH POOR" "M IN ERT" "NTER"     ""  

So I change my list to the following and reexecute:

lista <- paste("\\b", lista, "\\b", sep = "")
mgsub(lista, "", test)
# [1] "PALMA DE MALLORCA"     "THE RICH AND THE POOR" "A CAMEL IN THE DESERT"
# [4] "SANTANDER SL"          "LA"   

I cannot get the word boundary regex to work for this function.

回答1:

According to multigsub {qdap} documentation:

mgsub(pattern, replacement = NULL, text.var, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = TRUE, ...)
...
fixed
logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

To make sure your vector of search terms is parsed as regular expressions, you need to "manually" set the fixed parameter to FALSE.

Another important note: the word boundary set after . requires a word character after it (or end of line). It is safer to use (?!\w) subpattern in this case. To use look-arounds in R regex, you need to use Perl-like regex. Thus, I suggest using this (if a non-word character can appear only at the end of the regex):

lista <- paste("\\b", lista, "(?!\\w)", sep = "")

or (if there can be a non-word character at the beginning, too):

lista <- paste("(?<!\\w)", lista, "(?!\\w)", sep = "")

and then

mgsub(lista, "", test, fixed=FALSE, perl=TRUE)