Vectorizing a loop through lines of data frame R w

2019-09-05 22:34发布

问题:

Yet another apply question.

I've reviewed a lot of documentation on the apply family of functions in R (and use them quite a bit in my work). I've defined a function myfun below which I want to apply to every row of the dataframe inc. I think I need some variant of apply(inc,1,myfun) I've played around with it for a while, but still can't quite get it. I've included a loop which achieves exactly what I want to do... it's just super slow and inefficient on my real data which is considerably larger than the sample data I've included here.

I expect it's a quick fix, but I can't quite put my finger on it... maybe something with special argument ... to apply?

English version of what the code below does: I want to look at all the Submit Dates in the inc dataframe and find for each of these dates, how many rows in chg there are where chg$Submit.Date is within some range of the inc$Submit.Date. Where the range is controlled by fdays and bdays in myfun

setting up some fake data

chgdf <- data.frame(Submit.Date=as.Date(c("2013-09-27", "2013-09-4", "2013-08-01", "2013-06-24", '2013-05-29', '2013-08-20')), ID=c('001', '001', '001', '001', '001', '005'), stringsAsFactors=F)
incdf <- data.frame(Submit.Date=as.Date(c("2013-10-19", "2013-09-14", "2013-08-22", '2013-08-20')), ID=c('001', '001', '002', '006'), stringsAsFactors=F)

the function i want to apply to every line of the data frame inc

myfun <- function(tdate, aid, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
  fdays <- tdate+fdays
  bdays <- tdate-bdays
  chg2 <- chg[chg$ID==aid & chg$Submit.Date<fdays & chg$Submit.Date>bdays, ]
  ret <- nrow(chg2)
  return(ret)
}

works for one line of inc dataframe

tdate <- inc[inc$ID==aid, 'Submit.Date'][1]
myfun(tdate, aid='001', bdays=50, fdays=100)

works but slow...with full dataset

inc$chgw <- 0
for(i in 1:nrow(inc)){
  aid <- inc$ID[i]
  tdate <- inc$Submit.Date[i]
  inc$chgw[i] <- myfun(tdate, aid, bdays=50, fdays=100)
}

回答1:

First, when you call apply all values are coerced to strings, so you need to convert tdate before using it. Otherwise you're trying to add days to a string:

tdate <- as.Date(tdate)
fdays <- tdate+fdays
bdays <- tdate-bdays

Second, you call apply(inc, 1, myfun). Note that in that case you're passing a single parameter to myfun (the whole row), and not several parameters as myfun is supposed to receive.

Solution 1: Change your function to receive a whole row of the dataframe and call as you did:

myfun <- function(row, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
  tdate <- as.Date(row[1])
  fdays <- tdate+fdays
  bdays <- tdate-bdays
  chgdf2 <- chgdf[chgdf$ID==row[2] & chgdf$Submit.Date<fdays & chgdf$Submit.Date>bdays, ]
  ret <- nrow(chgdf2)
  return(ret)
}
> apply(inc, 1, myfun)
[1] 1 2 0 0

Solution 2: Call apply using all parameters in the function call:

myfun <- function(tdate, aid, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
  fdays <- tdate+fdays
  bdays <- tdate-bdays
  chgdf2 <- chgdf[chgdf$ID==aid & chgdf$Submit.Date<fdays & chgdf$Submit.Date>bdays, ]
  ret <- nrow(chgdf2)
  return(ret)
}
> apply(inc, 1, function(row) myfun(as.Date(row[1]), row[2]))
[1] 1 2 0 0

I personally prefer the second solution, because it gives you the possibility to change the default values of your other parameters in myfun:

> apply(inc, 1, function(row) myfun(as.Date(row[1]), row[2], bdays=50, fdays=50))
[1] 2 3 0 0


回答2:

Similar to Julian's answer:

sapply(
  split(incdf, 1:nrow(incdf)), 
  function(x) do.call(myfun, c(unname(x), bdays=50, fdays=100))
)

Here I don't use apply because apply will coerce the whole row to the same type, which may not be desirable. Note we need to unname(x) because your df doesn't have the same column names as args to your function.