I am trying to replace substrings of string elements within a vector with blank spaces. Below are the vectors we are considering:
test <- c("PALMA DE MALLORCA", "THE RICH AND THE POOR", "A CAMEL IN THE DESERT", "SANTANDER SL", "LA")
lista <- c("EL", "LA", "ES", "DE", "Y", "DEL", "LOS", "S.L.", "S.A.", "S.C.", "LAS",
"DEL", "THE", "OF", "AND", "BY", "S", "L", "A", "C", "SA", "SC", "SL")
Then if we apply the mgsub
function as it is, we get the following output:
library(qdap)
mgsub(lista, "", test)
# [1] "PM MOR" "RIH POOR" "M IN ERT" "NTER" ""
So I change my list to the following and reexecute:
lista <- paste("\\b", lista, "\\b", sep = "")
mgsub(lista, "", test)
# [1] "PALMA DE MALLORCA" "THE RICH AND THE POOR" "A CAMEL IN THE DESERT"
# [4] "SANTANDER SL" "LA"
I cannot get the word boundary regex to work for this function.
According to
multigsub {qdap}
documentation:To make sure your vector of search terms is parsed as regular expressions, you need to "manually" set the
fixed
parameter toFALSE
.Another important note: the word boundary set after
.
requires a word character after it (or end of line). It is safer to use(?!\w)
subpattern in this case. To use look-arounds in R regex, you need to use Perl-like regex. Thus, I suggest using this (if a non-word character can appear only at the end of the regex):or (if there can be a non-word character at the beginning, too):
and then