Efficient method for counting open cases at time o

2019-02-09 18:48发布

In a large data set (~1M cases), each case has a "created" and a "censored" dateTime. I want to count the number of other cases that were open at the time each case was created. Cases are open between their "created" and "censored" dataTimes.

Several solutions work well on small datasets (<100,000 cases), but computation time grows exponentially. My estimate is that computation time increases as a function 3n^2. By n=100,000 cases, computation time is >20 mins on my server with 6 * 4GHz cores & 64GB RAM. Even with multi-core libraries, at best, I could reduce the time by a factor of 8 or 10. Not enough to handle ~1M cases.

I'm looking for a more efficient method to do this calculation. Below I have provided a function that allows you to easily create large numbers of "created" and "censored" dateTime pairs along with two solutions tried so far, using dplyr and data.table libraries. The timings are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view times again and easily compare the timing of other solutions you might have to suggest.

I will update the original post with other solutions to give proper credit to their authors. Thanks in advance for your help with this!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility 
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test           
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) { 
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function 
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped, 
                           open_cases_at_creation = sum((cases_table$created  < created & 
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");        
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE], 
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {    
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created & 
                                                     cases_table$censored > created)),by=id];   

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) { 
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created, 
                                                                    cases_table$created), 
                                                            IRanges(cases_table$created, 
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table)); 
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table)); 
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));         

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

The data.table::foverlaps() solution is faster for fewer cases (<5000 or so; depends on randomness in addition to n since it uses binary search to optimize). The dplyr::summarize() solution is faster for more cases (>5,000 or so). Much beyond 100,000, neither solution is viable as they're both too slow.

EDIT: Added a third solution based on the idea suggested by @jangorecki that uses data.table aggregation instead of dplyr::summarize(), and is otherwise similar to the dplyr solution. For up to around 50,000 cases, it is the fastest solution. Beyond 50,000 cases, the dplyr::summarize() solution is slightly faster, but not by much. Sadly, for 1M cases it's still not practical.

EDIT2: Added a fourth solution adapted from the solution suggested by @alexis_laz that uses the IRanges package and its countOverlaps function. It is significantly faster than the 3 other solutions. With 50,000 cases it was almost 400% faster than solutions 1 & 3.

EDIT3: Modified case generating function to properly exercise the "censored" condition. Thanks to @jangorecki for catching the limitation of the previous version.

EDIT4: Rewrote to allow user to select which solutions to execute and to use system.time() for timing comparison with garbage collection before each execution for more accurate timing (as per @jangorecki's astute observation) - Also added some condition checks for crash cases.

EDIT5: Added a fifth solution adapted from the solution suggested by @danas.zuokas using rank(). My experimentation suggests that it's always at least an order of magnitude slower than the other solutions. At 10,000 cases, it takes 44 seconds vs. 3.5 seconds for dplyr::summarize and 0.36 seconds for IRanges solutions.

FINAL EDIT: I've made slight modifications to solution 5 that was suggested by @danas.zuokas, and matching the observation by @Khashaa about types. I've set the type as.numeric in the dataTime generation function which drastically speeds up rank as it operates on integers or doubles instead of dateTime objects(increases speed of other functions too, but not as drastically). With some testing, setting ties.method='first' yields results consistent with the intent. data.table::frank is faster than both base::rank and IRanges::rank. bit64::rank is fastest, but it seems to handle ties differently than data.table::frank and I can't get it to handle them as desired. Once bit64 is loaded, it masks a large number of types and functions, changing the results of data.table::frank along the way. The specific reasons why are beyond the scope of this question.

POST END NOTE: Turns out that data.table::frank handles POSIXct dateTimes efficiently, whereas neither base::rank nor IRanges::rank seem to. As such, even the as.numeric (or as.integer) type setting isn't necessary with data.table::frank and there is no loss of precision from the conversion, so there are fewer ties.method discrepancies. Thank you to everyone who contributed! I learned a lot! Much appreciated! :) Credit will be included in my source-code.

ENDNOTE: This question is a refined and clarified version, with easier to use and more readable example code, of More efficient method for counting open cases as of creation time of each case - I've separated it here to not overwhelm the original post with too many edits and to simplify the creation of large numbers of dataTime pairs in the example code. That way, you don't have to work as hard to answer. Thanks again!

2条回答
唯我独甜
2楼-- · 2019-02-09 19:12

This will probably not answer precisely your question well as the reproducible example is not exposed to cases_table$censored > created condition, see min and max below. Making smaller example would help you to spot such issues. Also you should use set.seed in your example.

set.seed(123)
library(data.table)
CASE_COUNT  <- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
generate_cases_table <- function(n = CASE_COUNT, start=RANGE_START, end=RANGE_END) {
    half_duration <- as.numeric(difftime(end, start, unit="sec")) / 2;
    start_offset  <- runif(n, 0, half_duration);
    end_offset    <- runif(n, 0, half_duration);
    data.table(id       = 1:n,created  = start + start_offset,censored = end   - end_offset)
}
cases_table = generate_cases_table()

cases_table[, .(min_censored = min(censored), max_created = max(created))]
#          min_censored         max_created
#1: 2006-01-01 13:02:12 2005-12-30 04:40:49

setorder(cases_table, created)[, created_so_far := .I - 1L]
cases_table[, censored_after := cases_table[cases_table, on = c("created" = "censored"), roll = Inf, which = TRUE]]

The roll join may need to be altered but I was not able to test due to mentioned issue with example data.
The which argument is just extracting the row number from rolling join, for sorted data it also means a count of observations after the join occurs. Mentioned issue results that value to be always 1000 cause all created are smaller than censored.
For detailed description of data.table rolling joins see this post: http://gormanalysis.com/r-data-table-rolling-joins/
Once you manage to apply that solution please share your timing difference in comments.

查看更多
何必那么认真
3楼-- · 2019-02-09 19:26

The answer is updated with the regards to a comment by the author of the question.

I would suggest a solution using ranks. Tables are created as in a follow up to this question, or using the dateTime pairs generation function in the present question. Both should work.

n <- cases_table[, .N]

# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
           ties.method = 'first')[1:n]

# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
           ties.method = 'first')[1:n]

Taking difference r1 - r2 (-1 not required with ties.method='first') gives the result (eliminating ranks of created). In terms of efficiency it takes only finding ranks of a vector of length of that number of rows in cases_table. data.table::frank handles POSIXct dateTime objects as quickly as numeric objects (unlike base::rank), so no type conversion is required.

查看更多
登录 后发表回答