The str_replace (and preg_replace) function in PHP replaces all occurrences of the search string with the replacement string. What interests me the most here, is that if search
and replace
args are arrays (in R we call that vectors), then str_replace
takes a value from each array (vector) and uses them to search and replace on subject.
In other words, does R (or some R package) have a function to perform the following:
string <- "The quick brown fox jumped over the lazy dog."
patterns <- c("quick", "brown", "fox")
replacements <- c("slow", "black", "bear")
xxx_replace_xxx(string, patterns, replacements) ## ???
## [1] "The slow black bear jumped over the lazy dog."
So I am seeking for something like chartr
, but for search patterns and replacement strings of arbitrary number of characters. This cannot be done via one call to gsub()
as its replacement
argument can be a single string only, see ?gsub
. So my current implementation is like:
xxx_replace_xxx <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
string
}
However, I am looking for something much faster if length(patterns)
is large - I have a lot of data to process and I'm dissatisfied with the current results.
Exemplary toy data for benchmarking:
string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
"po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
"sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
Using PCRE instead of fixed matching takes ~1/3 the time on my machine for your example.
xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
# user system elapsed
# 0.491 0.000 0.491
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
# user system elapsed
# 0.162 0.000 0.162
identical(x,p)
# [1] TRUE
If the patterns are fixed strings made of word characters as in the example then this works. gsubfn
is like gsub
except the replacment argument can be a string, list, function or proto object. If its a list, as here, it compares the matches to the regular expression with the names and for those that are found it replaces them with the corresponding values:
library(gsubfn)
gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string)
## [1] "The slow black bear jumped over the lazy dog."
This can be done with stringi >= 0.3-1 by using one of the stri_replace_*_all
functions with the vectorize_all
argument set to FALSE
:
library("stringi")
string <- "The quicker brown fox jumped over the lazy dog."
patterns <- c("quick", "brown", "fox")
replacements <- c("slow", "black", "bear")
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."
Some benchmarks:
string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
"po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
"sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
microbenchmark::microbenchmark(
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE),
xxx_replace_xxx_pcre(string, "\\b" %s+% patterns %s+% "\\b", replacements),
gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string),
unit="relative",
times=25
)
## Unit: relative
## expr min lq mean median uq max neval
## stri_replace_all_fixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 25
## stri_replace_all_regex 2.169701 2.248115 2.198638 2.267935 2.267635 1.753289 25
## xxx_replace_xxx_pcre 1.983135 1.967303 1.937021 1.961449 1.974422 1.469894 25
## gsubfn 63.067835 69.870657 69.815031 71.178841 72.503020 57.019072 25
So, as far as matching only at word boundaries is concerned, the PCRE-based version is the fastest.