I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).
Now I want all rows with the same ID having an overlapping time span (or have a start date that is right the day after the end date of another row) to be merged together.
Merging them means that they end up in one row having the same ID, the min(start date) and the max(end date) (I hope you understand what I mean).
I have written a function for that (it is not fully tested, but it looks fine for the moment). The problem is, as my data frame has nearly 100.000 observations, the function is very slow.
Can you help me improve my function in terms of efficiency?
Here is the function
smoothingEpisodes <- function (theData) {
theOutput <- data.frame()
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
theOutput
}
Thank you!
[edit]
test data:
ID START END
1 1 2000-01-01 2000-03-31
2 1 2000-04-01 2000-05-31
3 1 2000-04-15 2000-07-31
4 1 2000-09-01 2000-10-31
5 2 2000-01-15 2000-03-31
6 2 2000-02-01 2000-03-15
7 2 2000-04-01 2000-04-15
8 3 2000-06-01 2000-06-15
9 3 2000-07-01 2000-07-15
(START and END have data type "Date", ID is a numeric)
A dput of the data:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for
theOutput
. At the moment, you are growingtheOutput
at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.Looking at the code, we know that
theOutput
needs to havenrow(theData) - 1
rows, and 3 columns. So create that before the loop starts:then fill in this object during the loop:
for example.
It isn't clear what
START
andEND
are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the
data.frame()
call during each iteration:The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use
Rprof()
:The you can look at the output using
Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.
Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:
Using the test dataset provide in object
testData
, I get:a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.
I did it slightly different to avoid deleting empty rows in the end:
quite a big improvement to my original version!
Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.
As a side note, I try to use
[[
where applicable, and try to keep integers as integers by writingj <- j + 1L
etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...