Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?
My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.
Example (Data frame 1):
x y
Doggy 1
Hello 2
Hi Dog 3
Zebra 4
Example (Data frame 2)
z
Hello
Dog
So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".
Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.
x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")
z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")
dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){
if ( is.na(dat$flag[i])==TRUE ) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==0) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==1) {
dat$flag[i]==1
}
}
}
}
}
dat1 <- subset(dat, flag==1)
dat1