Is it possible to use a grepl argument when referring to a list of values, maybe using the %in% operator? I want to take the data below and if the animal name has "dog" or "cat" in it, I want to return a certain value, say, "keep"; if it doesn't have "dog" or "cat", I want to return "discard".
data <- data.frame(animal = sample(c("cat","dog","bird", 'doggy','kittycat'), 50, replace = T))
Now, if I were just to do this by strictly matching values, say, "cat" and "dog', I could use the following approach:
matches <- c("cat","dog")
data$keep <- ifelse(data$animal %in% matches, "Keep", "Discard")
But using grep or grepl only refers to the first argument in the list:
data$keep <- ifelse(grepl(matches, data$animal), "Keep","Discard")
returns
Warning message:
In grepl(matches, data$animal) :
argument 'pattern' has length > 1 and only the first element will be used
Note, I saw this thread in my search, but this doesn't appear to work:
grep using a character vector with multiple patterns
You can use an "or" (|
) statement inside the regular expression of grepl
.
ifelse(grepl("dog|cat", data$animal), "keep", "discard")
# [1] "keep" "keep" "discard" "keep" "keep" "keep" "keep" "discard"
# [9] "keep" "keep" "keep" "keep" "keep" "keep" "discard" "keep"
#[17] "discard" "keep" "keep" "discard" "keep" "keep" "discard" "keep"
#[25] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[33] "keep" "discard" "keep" "discard" "keep" "discard" "keep" "keep"
#[41] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[49] "keep" "discard"
The regular expression dog|cat
tells the regular expression engine to look for either "dog"
or "cat"
, and return the matches for both.
Not sure what you tried but this seems to work:
data$keep <- ifelse(grepl(paste(matches, collapse = "|"), data$animal), "Keep","Discard")
Similar to the answer you linked to.
The trick is using the paste:
paste(matches, collapse = "|")
#[1] "cat|dog"
So it creates a regular expression with either dog OR cat and would also work with a long list of patterns without typing each.
Edit:
In case you are doing this to later on subset the data.frame according to "Keep" and "Discard" entries, you could do this more directly using:
data[grepl(paste(matches, collapse = "|"), data$animal),]
This way, the results of grepl
which are TRUE or FALSE are used for the subset.
Try to avoid ifelse
as much as possible. This, for example, works nicely
c("Discard", "Keep")[grepl("(dog|cat)", data$animal) + 1]
For a 123
seed you will get
## [1] "Keep" "Keep" "Discard" "Keep" "Keep" "Keep" "Discard" "Keep"
## [9] "Discard" "Discard" "Keep" "Discard" "Keep" "Discard" "Keep" "Keep"
## [17] "Keep" "Keep" "Keep" "Keep" "Keep" "Keep" "Keep" "Keep"
## [25] "Keep" "Keep" "Discard" "Discard" "Keep" "Keep" "Keep" "Keep"
## [33] "Keep" "Keep" "Keep" "Discard" "Keep" "Keep" "Keep" "Keep"
## [41] "Keep" "Discard" "Discard" "Keep" "Keep" "Keep" "Keep" "Discard"
## [49] "Keep" "Keep"