remove multiple patterns from text vector r

2019-09-05 10:30发布

问题:

I want to remove multiple patterns from multiple character vectors. Currently I am going:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)

etc etc.

This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.

Neither the mapply nor the mgsub are working. I made these vectors

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")

Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.

a.vector looks like this:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   

I want:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `

回答1:

Try combining your subpatterns using |. For example

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.

Consider creating your remove vector as you suggested, then applying it in a loop

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants



回答2:

In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.

For the example you provided, you can try:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])"

a.vector <- gsub(removePat, "", a.vector)


回答3:

I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":

library(stringr)

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")

a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))

Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.