I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this.
For the later analysis I use the text2vec
library and would like to use that library for this task as well (avoiding tm
to reduce dependencies).
An (inefficient) way would look like this:
# setup data
text <- c("my automobile is quite nice", "I like my car")
syns <- list(
list(term = "happy_emotion", syns = c("nice", "like")),
list(term = "car", syns = c("automobile"))
)
My brute-force solution is to have something like this and use a loop to look for the words and replace them
library(stringr)
# works but is probably not the best...
text_res <- text
for (syn in syns) {
regex <- paste(syn$syns, collapse = "|")
text_res <- str_replace_all(text_res, pattern = regex, replacement = syn$term)
}
# which gives me what I want
text_res
# [1] "my car is quite happy_emotion" "I happy_emotion my car"
I used to do it with tm
using this approach by MrFlick (using tm::content_transformer
and tm::tm_map
), but I want to reduce the dependencies of the project by replacing tm
with the faster text2vec
.
I guess the optimal solution would be to somehow use text2vec
s itoken
, but I am unsure how. Any ideas?
Quite late, but still I want to add my 2 cents.
I see 2 solutions
Small improvement over your str_replace_all
. Since it is vectorized internally you can make all replacements without loop. I think it will be faster, but I didn't make any benchmarks.
regex_batch = sapply(syns, function(syn) paste(syn$syns, collapse = "|"))
names(regex_batch) = sapply(syns, function(x) x$term)
str_replace_all(text, regex_batch)
Naturally this task is for hash-table lookup. Fastest implementation as far as I know is in fastmatch
package. So you can write custom tokenizer, something like:
library(fastmatch)
syn_1 = c("nice", "like")
names(syn_1) = rep('happy_emotion', length(syn_1))
syn_2 = c("automobile")
names(syn_2) = rep('car', length(syn_2))
syn_replace_table = c(syn_1, syn_2)
custom_tokenizer = function(text) {
word_tokenizer(text) %>% lapply(function(x) {
i = fmatch(x, syn_replace_table)
ind = !is.na(i)
i = na.omit(i)
x[ind] = names(syn_replace_table)[i]
x
})
}
I would bet that second solution will work faster, but need to make some benchmarks.
With base R this should work:
mgsub <- function(pattern,replacement,x) {
if (length(pattern) != length(replacement)){
stop("Pattern not equal to Replacment")
}
for (v in 1:length(pattern)) {
x <- gsub(pattern[v],replacement[v],x)
}
return(x )
}
mgsub(c("nice","like","automobile"),c(rep("happy_emotion",2),"car"),text)
The first part of solution by Dmitriy Selivanov requires a small change.
library(stringr)
text <- c("my automobile is quite nice", "I like my car")
syns <- list(
list(term = "happy_emotion", syns = c("nice", "like")),
list(term = "car", syns = c("automobile"))
)
regex_batch <- sapply(syns, function(syn) syn$term)
names(regex_batch) <- sapply(syns, function(x) paste(x$syns, collapse = "|"))
text_res <- str_replace_all(text, regex_batch)
text_res
[1] "my car is quite happy_emotion" "I happy_emotion my car"