In a large data set (~1M cases), each case has a "created" and a "censored" dateTime
. I want to count the number of other cases that were open at the time each case was created. Cases are open between their "created" and "censored" dataTimes
.
Several solutions work well on small datasets (<100,000 cases), but computation time grows exponentially. My estimate is that computation time increases as a function 3n^2. By n=100,000 cases, computation time is >20 mins on my server with 6 * 4GHz cores & 64GB RAM. Even with multi-core libraries, at best, I could reduce the time by a factor of 8 or 10. Not enough to handle ~1M cases.
I'm looking for a more efficient method to do this calculation. Below I have provided a function that allows you to easily create large numbers of "created" and "censored" dateTime
pairs along with two solutions tried so far, using dplyr
and data.table
libraries. The timings are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view times again and easily compare the timing of other solutions you might have to suggest.
I will update the original post with other solutions to give proper credit to their authors. Thanks in advance for your help with this!
# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);
# Set seed for reproducibility
set.seed(123)
# Set number of cases & date range variables
CASE_COUNT <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
RANGE_END <- as.POSIXct("2012-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="UTC", origin="1970-01-01");
# Select which solutions you want to run in this test
RUN_SOLUTION_1 <- TRUE; # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE; # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE; # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE; # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE; # data.table::frank()
# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)
generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
# Measure duration between start_val & end_val
duration <- as.numeric(difftime(end_val, start_val, unit="secs"));
# Select random values in duration to create start_offset
start_offset <- runif(n, 0, duration);
# Calculate the creation time list
created_list <- start_offset + start_val;
# Calculate acceptable time range for censored values
# since they must always be after their respective creation value
censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));
# Select random values in duration to create end_offset
creation_to_censored_times <- runif(n, 0, censored_range);
censored_list <- created_list + creation_to_censored_times;
# Create and return a data.table with creation & censor values
# calculated from start or end with random offsets
return_table <- data.table(id = 1:n,
created = created_list,
censored = censored_list);
return(return_table);
}
# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();
solution_1_function <- function (cases_table) {
# SOLUTION 1: Using dplyr::summarize:
# Group by id to set parameters for summarize() function
cases_table_grouped <- group_by(cases_table, id);
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() within summarize()
cases_table_summary <- summarize(cases_table_grouped,
open_cases_at_creation = sum((cases_table$created < created &
cases_table$censored > created)));
solution_1_table <<- as.data.table(cases_table_summary, key="id");
} # End solution_1_function
solution_2_function <- function (cases_table) {
# SOLUTION 2: Using data.table::foverlaps:
# Adapted from solution provided by @Davidarenburg
# (https://stackoverflow.com/users/3001626/david-arenburg)
# The foverlaps() solution tends to crash R with large case counts
# I suspect it has to do with memory assignment of the very large objects
# It maxes RAM on my system (64GB) before crashing, possibly attempting
# to write beyond its assigned memory limits.
# I'll submit a reproduceable bug to the data.table team since
# foverlaps() is pretty new and known to be occasionally unstable
if (CASE_COUNT > 50000) {
stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
}
setDT(cases_table)[, created_dupe := created];
setkey(cases_table, created, censored);
foverlaps_table <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
cases_table[,c("id","created","censored"), with=FALSE],
by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];
foverlaps_table <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);
solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function
solution_3_function <- function (cases_table) {
# SOLUTION 3: Using data.table aggregation instead of dplyr::summarize
# Idea suggested by @jangorecki
# (https://stackoverflow.com/users/2490497/jangorecki)
# Count the instances where other cases were created before
# and censored after each case using vectorized sum() with data.table aggregation
cases_table_aggregated <- cases_table[order(id), sum((cases_table$created < created &
cases_table$censored > created)),by=id];
solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");
} # End solution_3_function
solution_4_function <- function (cases_table) {
# SOLUTION 4: Using IRanges package
# Adapted from solution suggested by @alexis_laz
# (https://stackoverflow.com/users/2414948/alexis-laz)
# The IRanges package generates ranges efficiently, intended for genome sequencing
# but working perfectly well on this data, since POSIXct values are numeric-representable
solution_4_table <<- data.table(id = cases_table$id,
open_cases_at_creation = countOverlaps(IRanges(cases_table$created,
cases_table$created),
IRanges(cases_table$created,
cases_table$censored))-1, key="id");
} # End solution_4_function
solution_5_function <- function (cases_table) {
# SOLUTION 5: Using data.table::frank()
# Adapted from solution suggested by @danas.zuokas
# (https://stackoverflow.com/users/1249481/danas-zuokas)
n <- CASE_COUNT;
# For every case compute the number of other cases
# with `created` less than `created` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];
# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];
solution_5_table <<- data.table(id = cases_table$id,
open_cases_at_creation = r1 - r2, key="id");
} # End solution_5_function;
# Execute user specified functions;
if (RUN_SOLUTION_1)
solution_1_timing <- system.time(solution_1_function(cases_table));
if (RUN_SOLUTION_2) {
solution_2_timing <- try(system.time(solution_2_function(cases_table)));
cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
solution_3_timing <- system.time(solution_3_function(cases_table));
if (RUN_SOLUTION_4)
solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
solution_5_timing <- system.time(solution_5_function(cases_table));
# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}
same_check <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
same_check2_3, same_check2_4, same_check2_5, same_check3_4,
same_check3_5, same_check4_5);
# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
"It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");
The data.table::foverlaps()
solution is faster for fewer cases (<5000 or so; depends on randomness in addition to n since it uses binary search to optimize). The dplyr::summarize()
solution is faster for more cases (>5,000 or so). Much beyond 100,000, neither solution is viable as they're both too slow.
EDIT: Added a third solution based on the idea suggested by @jangorecki that uses data.table
aggregation instead of dplyr::summarize()
, and is otherwise similar to the dplyr
solution. For up to around 50,000 cases, it is the fastest solution. Beyond 50,000 cases, the dplyr::summarize()
solution is slightly faster, but not by much. Sadly, for 1M cases it's still not practical.
EDIT2: Added a fourth solution adapted from the solution suggested by @alexis_laz that uses the IRanges
package and its countOverlaps
function.
It is significantly faster than the 3 other solutions. With 50,000 cases it was almost 400% faster than solutions 1 & 3.
EDIT3: Modified case generating function to properly exercise the "censored" condition. Thanks to @jangorecki for catching the limitation of the previous version.
EDIT4: Rewrote to allow user to select which solutions to execute and to use system.time()
for timing comparison with garbage collection before each execution for more accurate timing (as per @jangorecki's astute observation) - Also added some condition checks for crash cases.
EDIT5: Added a fifth solution adapted from the solution suggested by @danas.zuokas using rank()
. My experimentation suggests that it's always at least an order of magnitude slower than the other solutions. At 10,000 cases, it takes 44 seconds vs. 3.5 seconds for dplyr::summarize
and 0.36 seconds for IRanges
solutions.
FINAL EDIT: I've made slight modifications to solution 5 that was suggested by @danas.zuokas, and matching the observation by @Khashaa about types. I've set the type as.numeric
in the dataTime
generation function which drastically speeds up rank
as it operates on integers
or doubles
instead of dateTime
objects(increases speed of other functions too, but not as drastically). With some testing, setting ties.method='first'
yields results consistent with the intent. data.table::frank
is faster than both base::rank
and IRanges::rank
. bit64::rank
is fastest, but it seems to handle ties differently than data.table::frank
and I can't get it to handle them as desired. Once bit64
is loaded, it masks a large number of types and functions, changing the results of data.table::frank
along the way. The specific reasons why are beyond the scope of this question.
POST END NOTE: Turns out that data.table::frank
handles POSIXct
dateTimes
efficiently, whereas neither base::rank
nor IRanges::rank
seem to. As such, even the as.numeric
(or as.integer
) type setting isn't necessary with data.table::frank
and there is no loss of precision from the conversion, so there are fewer ties.method
discrepancies.
Thank you to everyone who contributed! I learned a lot! Much appreciated! :)
Credit will be included in my source-code.
ENDNOTE: This question is a refined and clarified version, with easier to use and more readable example code, of More efficient method for counting open cases as of creation time of each case - I've separated it here to not overwhelm the original post with too many edits and to simplify the creation of large numbers of dataTime
pairs in the example code. That way, you don't have to work as hard to answer. Thanks again!
This will probably not answer precisely your question well as the reproducible example is not exposed to
cases_table$censored > created
condition, seemin
andmax
below. Making smaller example would help you to spot such issues. Also you should useset.seed
in your example.The
roll
join may need to be altered but I was not able to test due to mentioned issue with example data.The
which
argument is just extracting the row number from rolling join, for sorted data it also means a count of observations after the join occurs. Mentioned issue results that value to be always 1000 cause allcreated
are smaller thancensored
.For detailed description of data.table rolling joins see this post: http://gormanalysis.com/r-data-table-rolling-joins/
Once you manage to apply that solution please share your timing difference in comments.
The answer is updated with the regards to a comment by the author of the question.
I would suggest a solution using ranks. Tables are created as in a follow up to this question, or using the
dateTime
pairs generation function in the present question. Both should work.Taking difference
r1 - r2
(-1 not required with ties.method='first') gives the result (eliminating ranks ofcreated
). In terms of efficiency it takes only finding ranks of a vector of length of that number of rows incases_table
.data.table::frank
handlesPOSIXct
dateTime
objects as quickly asnumeric
objects (unlikebase::rank
), so no type conversion is required.