Making gsub only replace entire words?

2020-02-10 23:04发布


(I'm using R.) For a list of words that's called "goodwords.corpus", I am looping through the documents in a corpus, and replacing each of the words on the list "goodwords.corpus" with the word + a number.

So for example if the word "good" is on the list, and "goodnight" is NOT on the list, then this document:

I am having a good time goodnight

would turn into:

I am having a good 1234 time goodnight

**I'm using this code (EDIT- made this reproducible):

goodwords.corpus <- c("good")
test <- "I am having a good time goodnight"
for (i in 1:length(goodwords.corpus)){
test <-gsub(goodwords.corpus[[i]], paste(goodwords.corpus[[i]], "1234"), test)

However, the problem is I want gsub to only replace ENTIRE words. The issue that arises is that: "good" is on the "goodwords.corpus" list, but then "goodnight", which is NOT on the list, is also affected. So I get this:

I am having a good 1234 time good 1234night

Is there anyway I can tell gsub to only replace ENTIRE words, and not words that might be a part of other words?

I want to use this:

test <-gsub("\\<goodwords.corpus[[i]]\\>", paste(goodwords.corpus[[i]], "1234"), test)

I've read that the \< and \> will tell gsub to only look for whole words. But obviously that doesn't work, because goodwords.corpus[[i]] won't work when it's in quotes.

Any suggestions?


You are so close to getting this. You're already using paste to form the replacement string, why not use it to form the pattern string?

goodwords.corpus <- c("good")
test <- "I am having a good time goodnight"
for (i in 1:length(goodwords.corpus)){
    test <-gsub(paste0('\\<', goodwords.corpus[[i]], '\\>'), paste(goodwords.corpus[[i]], "1234"), test)
# [1] "I am having a good 1234 time goodnight"

(paste0 is merely paste(..., sep='').)

(I posted this the same time as @MatthewLundberg, and his is also correct. I'm actually more familiar with using \b vice \<, but I thought I'd continue with using your code.)


Use \b to indicate a word boundary:

> text <- "good night goodnight"
> gsub("\\bgood\\b", paste("good", 1234), text)
[1] "good 1234 night goodnight"

In your loop, something like this:

for (word in goodwords.corpus){
  patt <- paste0('\\b', word, '\\b')
  repl <- paste(word, "1234")

  test <-gsub(patt, repl, test)