R Subset Dataset Using Regular Expression

2019-02-25 04:46发布

问题:

Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?

My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.

Example (Data frame 1):

x        y 
Doggy    1 
Hello    2 
Hi Dog   3 
Zebra    4 

Example (Data frame 2)

z
Hello
Dog

So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".

Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.

x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")

z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")

dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){ 

    if ( is.na(dat$flag[i])==TRUE ) {
        dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
    } else {

    if (dat$flag[i]==0) {
        dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))

    } else { 

    if (dat$flag[i]==1) {
        dat$flag[i]==1
    }
    }
    }
}
}

dat1 <- subset(dat, flag==1)
dat1  

回答1:

Try this:

dat[grep(paste(z$z, collapse = "|"), dat$x), ]

or

subset(dat, grepl(paste(z$z, collapse = "|"), x))


回答2:

This question inspired a boolean text search function (%bs%) in the qdap package and thus I thought I'd share the approach to this question:

library(qdap)
dat[dat$x %bs% paste(z$z, collapse = "OR"), ]

In this case no less typing but if multiple or/and statements are involved this may be a useful approach.



标签: regex r subset