Replace words in text2vec efficiently

2019-04-12 20:00发布

问题:

I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this.

For the later analysis I use the text2vec library and would like to use that library for this task as well (avoiding tm to reduce dependencies).

An (inefficient) way would look like this:

# setup data
text <- c("my automobile is quite nice", "I like my car")

syns <- list(
  list(term = "happy_emotion", syns = c("nice", "like")),
  list(term = "car", syns = c("automobile"))
)

My brute-force solution is to have something like this and use a loop to look for the words and replace them

library(stringr)
# works but is probably not the best...
text_res <- text
for (syn in syns) {
  regex <- paste(syn$syns, collapse = "|")
  text_res <-  str_replace_all(text_res, pattern = regex, replacement = syn$term)
}
# which gives me what I want
text_res
# [1] "my car is quite happy_emotion" "I happy_emotion my car" 

I used to do it with tm using this approach by MrFlick (using tm::content_transformer and tm::tm_map), but I want to reduce the dependencies of the project by replacing tm with the faster text2vec.

I guess the optimal solution would be to somehow use text2vecs itoken, but I am unsure how. Any ideas?

回答1:

Quite late, but still I want to add my 2 cents. I see 2 solutions

  1. Small improvement over your str_replace_all. Since it is vectorized internally you can make all replacements without loop. I think it will be faster, but I didn't make any benchmarks.

    regex_batch = sapply(syns, function(syn) paste(syn$syns, collapse = "|"))  
    names(regex_batch) = sapply(syns, function(x) x$term)  
    str_replace_all(text, regex_batch)  
    
  2. Naturally this task is for hash-table lookup. Fastest implementation as far as I know is in fastmatchpackage. So you can write custom tokenizer, something like:

    library(fastmatch)
    
    syn_1 = c("nice", "like")
    names(syn_1) = rep('happy_emotion', length(syn_1))
    syn_2 = c("automobile")
    names(syn_2) = rep('car', length(syn_2))
    
    syn_replace_table = c(syn_1, syn_2)
    
    custom_tokenizer = function(text) {
      word_tokenizer(text) %>% lapply(function(x) {
        i = fmatch(x, syn_replace_table)
        ind = !is.na(i)
        i = na.omit(i)
        x[ind] = names(syn_replace_table)[i]
        x
      })
    }
    

I would bet that second solution will work faster, but need to make some benchmarks.



回答2:

With base R this should work:

mgsub <- function(pattern,replacement,x) {
if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
} 
    for (v in 1:length(pattern)) {
        x  <- gsub(pattern[v],replacement[v],x)
    }
return(x )
}

mgsub(c("nice","like","automobile"),c(rep("happy_emotion",2),"car"),text)


回答3:

The first part of solution by Dmitriy Selivanov requires a small change.

library(stringr)    

text <- c("my automobile is quite nice", "I like my car")

syns <- list(
             list(term = "happy_emotion", syns = c("nice", "like")),
             list(term = "car", syns = c("automobile"))
             )

regex_batch <- sapply(syns, function(syn) syn$term)  
names(regex_batch) <- sapply(syns, function(x) paste(x$syns, collapse = "|"))  
text_res <- str_replace_all(text, regex_batch) 

text_res
[1] "my car is quite happy_emotion" "I happy_emotion my car"  


标签: r text2vec