Replace a data frame column based on regex

2020-04-12 08:45发布

I am trying to extract part of a column in a data frame using regular expressions. Problems I am running into include the facts that grep returns the whole value, not just the matched part, and that str_extract doesn't seem to work in a vectorized way.

Here is what I'm trying. I would like df$match to show alpha.alpha. where the pattern exists and NA otherwise. How can I show only the matched part?

Also, how I can I replace [a-zA-Z] in R regex? Can I use a character class or a POSIX code like [:alpha:]?

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match

#TRUE FALSE  TRUE FALSE

v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)

df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA

df

#v1  v2      match
#1   _a.b._  _a.b._
#2   <NA>    <NA>
#3   _C.D._  _C.D._
#4   _ef_    <NA>

What I want:

#v1  v2      match
#1   _a.b._  a.b.
#2   <NA>    <NA>
#3   _C.D._  C.D.
#4   _ef_    <NA>

标签: regex r
3条回答
Viruses.
2楼-- · 2020-04-12 08:56

4 Approaches...

Here's 2 approaches in base as well as with rm_default(extract=TRUE) in the qdapRegex package I maintain and the stringi package.

unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
        ifelse(identical(character(0), x), NA, x)
    })
)

## [1] "a.b." NA     "C.D." NA 

pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])

## [1] "a.b." NA     "C.D." NA

library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))

## [1] "a.b." NA     "C.D." NA 

library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")

## [1] "a.b." NA     "C.D." NA 
查看更多
Luminary・发光体
3楼-- · 2020-04-12 09:02

One possible solution using both grepl and sub:

# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*", 
                replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
                   yes = df$match, no = NA)

Results

df

#   v1     v2 match
# 1  1 _a.b._  a.b.
# 2  2   <NA>  <NA>
# 3  3 _C.D._  C.D.
# 4  4   _ef_  <NA>

Data

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)
查看更多
够拽才男人
4楼-- · 2020-04-12 09:20

Base R solution using regmatches, and regexpr which returns -1 if no regex match is found:

r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)

#  v1     v2 match
#1  1 _a.b._  a.b.
#2  2   <NA>  <NA>
#3  3 _C.D._  C.D.
#4  4   _ef_  <NA>
查看更多
登录 后发表回答