Methods in R for large complex survey data sets?

2019-05-06 20:33发布

I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency Department Sample (NEDS). As described by the Agency for Healthcare Research and Quality, it is "Discharge data for ED visits from 947 hospitals located in 30 States, approximating a 20-percent stratified sample of U.S. hospital-based EDs"

The full dataset from 2006 to 2012 consists of 198,102,435 observations. I've subsetted the data to 40,073,358 traumatic injury-related discharges with 66 variables. Running even simple survey procedures on these data takes inordinately long amounts of time. I've tried throwing RAM at it (late 2013 Mac Pro, 3.7GHz Quad Core, 128GB (!) memory), using multicore when available, subsetting , working with an out-of-memory DBMS like MonetDB. Design-based survey procedures still take hours. Sometimes many hours. Some modestly complex analyses take upwards of 15 hours. I am guessing that most of the computational effort is tied to what must be a humongous covariance matrix?

As one might expect, working with the raw data is orders of magnitude faster. More interestingly, depending on the procedure, with a data set this large the unadjusted estimates can be quite close to the survey results. (See examples below) The design-based results are clearly more precise and preferred, but several hours of computing time vs seconds is a not inconsiderable cost for that added precision. It begins to look like a very long walk around the block.

Is there anyone who's had experience with this? Are there ways to optimize R survey procedures for large data sets? Perhaps make better use of parallel processing? Are Bayesian approaches using INLA or Hamiltonian methods like Stan a possible solution? Or, are some unadjusted estimates, especially for relative measures, acceptable when the survey is large and representative enough?

Here are a couple of examples of unadjusted estimates approximating survey results.

In this first example, svymean in memory took a bit less than an hour, out of memory required well over 3 hours. The direct calculation took under a second. More importantly, the point estimates (34.75 for svymean and 34.77 unadjusted) as well as the standard errors (0.0039 and 0.0037) are quite close.

    # 1a. svymean in memory 

    svydes<- svydesign(
        id = ~KEY_ED ,
        strata = ~interaction(NEDS_STRATUM , YEAR),   note YEAR interaction
        weights = ~DISCWT ,
        nest = TRUE,
        data = inj
    )

    system.time(meanAGE<-svymean(~age, svydes, na.rm=T))
         user   system  elapsed
     3082.131  143.628 3208.822 
     > meanAGE 
           mean     SE
     age 34.746 0.0039 

    # 1b. svymean out of memory
    db_design <-
        svydesign(
            weight = ~discwt ,                                   weight variable column
            nest = TRUE ,                                        whether or not psus are nested within strata
            strata = ~interaction(neds_stratum , yr) ,           stratification variable column
            id = ~key_ed ,                                          
            data = "nedsinj0612" ,                               table name within the monet database
            dbtype = "MonetDBLite" ,
            dbname = "~/HCUP/HCUP NEDS/monet"  folder location
        )

    system.time(meanAGE<-svymean(~age, db_design, na.rm=T))
          user    system   elapsed
     11749.302   549.609 12224.233
     Warning message:
     'isIdCurrent' is deprecated.
     Use 'dbIsValid' instead.
     See help("Deprecated")
           mean     SE
     age 34.746 0.0039 


    # 1.c unadjusted mean and s.e.
    system.time(print(mean(inj$AGE, na.rm=T)))
     [1] 34.77108
        user  system elapsed
       0.407   0.249   0.653
      sterr <- function(x) sd(x, na.rm=T)/sqrt(length(x))  # write little function for s.e.
     system.time(print(sterr(inj$AGE)))
     [1] 0.003706483
        user  system elapsed
       0.257   0.139   0.394 

There is a similar correspondence between the results of svymean vs mean applied to subsets of data using svyby (nearly 2 hours) vs tapply (4 seconds or so):

# 2.a svyby .. svymean
system.time(AGEbyYear<-svyby(~age, ~yr, db_design, svymean, na.rm=T, vartype = c( 'ci' , 'se' )))
     user   system  elapsed
 4600.050  376.661 6594.196 
        yr      age          se     ci_l     ci_u
 2006 2006 33.83112 0.009939669 33.81163 33.85060
 2007 2007 34.07261 0.010055909 34.05290 34.09232
 2008 2008 34.57061 0.009968646 34.55107 34.59014
 2009 2009 34.87537 0.010577461 34.85464 34.89610
 2010 2010 35.31072 0.010465413 35.29021 35.33124
 2011 2011 35.33135 0.010312395 35.31114 35.35157
 2012 2012 35.30092 0.010313871 35.28071 35.32114


# 2.b tapply ... mean
system.time(print(tapply(inj$AGE, inj$YEAR, mean, na.rm=T)))
     2006     2007     2008     2009     2010     2011     2012
 33.86900 34.08656 34.60711 34.81538 35.27819 35.36932 35.38931
    user  system elapsed
   3.388   1.166   4.529

system.time(print(tapply(inj$AGE, inj$YEAR, sterr)))
        2006        2007        2008        2009        2010        2011        2012
 0.009577755 0.009620235 0.009565588 0.009936695 0.009906659 0.010148218 0.009880995
    user  system elapsed
   3.237   0.990   4.186

The correspondence between survey and unadjusted results starts to break down with absolute counts, which requires writing a small function that appeals to the the survey object and uses a small bit some of Dr. Lumley's code to weight the counts:

# 3.a svytotal

system.time(print(svytotal(~adj_cost, svydes, na.rm=T)))
             total       SE
adj_cost 9.975e+10 26685092
     user    system   elapsed 
10005.837   610.701 10577.755 

# 3.b "direct" calculation

SurvTot<-function(x){
    N <- sum(1/svydes$prob)
    m <- mean(x, na.rm = T)
    total <- m * N
    return(total)
}

> system.time(print(SurvTot(inj$adj_cost)))
[1] 1.18511e+11
   user  system elapsed 
  0.735   0.311   0.989 

The results are much less acceptable. Though still within the margin of error established by the survey procedure. But again, 3 hours vs. 1 second is an appreciable cost for the more precise results.

Update: 10 Feb 2016

Thanks Severin and Anthony for allowing me to borrow your synapses. Sorry for the delay in following up, has taken little time to try out both your suggestions.

Severin , you are right in your observations that Revolution Analytics/MOR build is faster for some operations. Looks like it has to do with the BLAS ("Basic Linear Algebra Subprograms") library shipped with CRAN R. It is more precise, but slower. So, I optimized the BLAS on my maching with the proprietary (but free with macs) Apple Accelerate vecLib that allows multithreading (see http://blog.quadrivio.com/2015/06/improved-r-performance-with-openblas.html). This seemed to shave some time off the operations, e.g. from 3 hours for a svyby/svymean to a bit over 2 hours.

Anthony, had less luck with the replicate weight design approach. type="bootstrap" with replicates=20 ran for about 39 hours before I quit out; type="BRR" returned error "Can't split with odd numbers of PSUs in a stratum", when I set the options to small="merge", large="merge", it ran for several hours before the OS heaved a huge sigh and ran out of application memory; type="JKn" returned he error "cannot allocate vector of size 11964693.8 Gb"

Again, many thanks for your suggestions. I will for now, resign myself to running these analyses piecemeal and over long periods of time. If I do eventually come up with a better approach, I'll post on SO

1条回答
我命由我不由天
2楼-- · 2019-05-06 20:55

for huge data sets, linearized designs (svydesign) are much slower than replication designs (svrepdesign). review the weighting functions within survey::as.svrepdesign and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign but instead using the functions within it.

for one example using cluster=, strata=, and fpc= directly into a replicate-weighted design, see

https://github.com/ajdamico/asdfree/blob/master/Censo%20Demografico/download%20and%20import.R#L405-L429

note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/

also note that the replicates= argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight

查看更多
登录 后发表回答