I am trying to extract part of a column in a data frame using regular expressions. Problems I am running into include the facts that grep
returns the whole value, not just the matched part, and that str_extract
doesn't seem to work in a vectorized way.
Here is what I'm trying. I would like df$match
to show alpha.alpha.
where the pattern exists and NA
otherwise. How can I show only the matched part?
Also, how I can I replace [a-zA-Z]
in R regex? Can I use a character class or a POSIX code like [:alpha:]
?
v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)
df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match
#TRUE FALSE TRUE FALSE
v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)
df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA
df
#v1 v2 match
#1 _a.b._ _a.b._
#2 <NA> <NA>
#3 _C.D._ _C.D._
#4 _ef_ <NA>
What I want:
#v1 v2 match
#1 _a.b._ a.b.
#2 <NA> <NA>
#3 _C.D._ C.D.
#4 _ef_ <NA>
4 Approaches...
Here's 2 approaches in base as well as with
rm_default(extract=TRUE)
in the qdapRegex package I maintain and the stringi package.One possible solution using both
grepl
andsub
:Results
Data
Base R solution using
regmatches
, andregexpr
which returns-1
if no regex match is found: