I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency Department Sample (NEDS). As described by the Agency for Healthcare Research and Quality, it is "Discharge data for ED visits from 947 hospitals located in 30 States, approximating a 20-percent stratified sample of U.S. hospital-based EDs"
The full dataset from 2006 to 2012 consists of 198,102,435 observations. I've subsetted the data to 40,073,358 traumatic injury-related discharges with 66 variables. Running even simple survey procedures on these data takes inordinately long amounts of time. I've tried throwing RAM at it (late 2013 Mac Pro, 3.7GHz Quad Core, 128GB (!) memory), using multicore when available, subsetting , working with an out-of-memory DBMS like MonetDB. Design-based survey procedures still take hours. Sometimes many hours. Some modestly complex analyses take upwards of 15 hours. I am guessing that most of the computational effort is tied to what must be a humongous covariance matrix?
As one might expect, working with the raw data is orders of magnitude faster. More interestingly, depending on the procedure, with a data set this large the unadjusted estimates can be quite close to the survey results. (See examples below) The design-based results are clearly more precise and preferred, but several hours of computing time vs seconds is a not inconsiderable cost for that added precision. It begins to look like a very long walk around the block.
Is there anyone who's had experience with this? Are there ways to optimize R survey procedures for large data sets? Perhaps make better use of parallel processing? Are Bayesian approaches using INLA or Hamiltonian methods like Stan a possible solution? Or, are some unadjusted estimates, especially for relative measures, acceptable when the survey is large and representative enough?
Here are a couple of examples of unadjusted estimates approximating survey results.
In this first example, svymean in memory took a bit less than an hour, out of memory required well over 3 hours. The direct calculation took under a second. More importantly, the point estimates (34.75 for svymean and 34.77 unadjusted) as well as the standard errors (0.0039 and 0.0037) are quite close.
# 1a. svymean in memory
svydes<- svydesign(
id = ~KEY_ED ,
strata = ~interaction(NEDS_STRATUM , YEAR), note YEAR interaction
weights = ~DISCWT ,
nest = TRUE,
data = inj
)
system.time(meanAGE<-svymean(~age, svydes, na.rm=T))
user system elapsed
3082.131 143.628 3208.822
> meanAGE
mean SE
age 34.746 0.0039
# 1b. svymean out of memory
db_design <-
svydesign(
weight = ~discwt , weight variable column
nest = TRUE , whether or not psus are nested within strata
strata = ~interaction(neds_stratum , yr) , stratification variable column
id = ~key_ed ,
data = "nedsinj0612" , table name within the monet database
dbtype = "MonetDBLite" ,
dbname = "~/HCUP/HCUP NEDS/monet" folder location
)
system.time(meanAGE<-svymean(~age, db_design, na.rm=T))
user system elapsed
11749.302 549.609 12224.233
Warning message:
'isIdCurrent' is deprecated.
Use 'dbIsValid' instead.
See help("Deprecated")
mean SE
age 34.746 0.0039
# 1.c unadjusted mean and s.e.
system.time(print(mean(inj$AGE, na.rm=T)))
[1] 34.77108
user system elapsed
0.407 0.249 0.653
sterr <- function(x) sd(x, na.rm=T)/sqrt(length(x)) # write little function for s.e.
system.time(print(sterr(inj$AGE)))
[1] 0.003706483
user system elapsed
0.257 0.139 0.394
There is a similar correspondence between the results of svymean vs mean applied to subsets of data using svyby (nearly 2 hours) vs tapply (4 seconds or so):
# 2.a svyby .. svymean
system.time(AGEbyYear<-svyby(~age, ~yr, db_design, svymean, na.rm=T, vartype = c( 'ci' , 'se' )))
user system elapsed
4600.050 376.661 6594.196
yr age se ci_l ci_u
2006 2006 33.83112 0.009939669 33.81163 33.85060
2007 2007 34.07261 0.010055909 34.05290 34.09232
2008 2008 34.57061 0.009968646 34.55107 34.59014
2009 2009 34.87537 0.010577461 34.85464 34.89610
2010 2010 35.31072 0.010465413 35.29021 35.33124
2011 2011 35.33135 0.010312395 35.31114 35.35157
2012 2012 35.30092 0.010313871 35.28071 35.32114
# 2.b tapply ... mean
system.time(print(tapply(inj$AGE, inj$YEAR, mean, na.rm=T)))
2006 2007 2008 2009 2010 2011 2012
33.86900 34.08656 34.60711 34.81538 35.27819 35.36932 35.38931
user system elapsed
3.388 1.166 4.529
system.time(print(tapply(inj$AGE, inj$YEAR, sterr)))
2006 2007 2008 2009 2010 2011 2012
0.009577755 0.009620235 0.009565588 0.009936695 0.009906659 0.010148218 0.009880995
user system elapsed
3.237 0.990 4.186
The correspondence between survey and unadjusted results starts to break down with absolute counts, which requires writing a small function that appeals to the the survey object and uses a small bit some of Dr. Lumley's code to weight the counts:
# 3.a svytotal
system.time(print(svytotal(~adj_cost, svydes, na.rm=T)))
total SE
adj_cost 9.975e+10 26685092
user system elapsed
10005.837 610.701 10577.755
# 3.b "direct" calculation
SurvTot<-function(x){
N <- sum(1/svydes$prob)
m <- mean(x, na.rm = T)
total <- m * N
return(total)
}
> system.time(print(SurvTot(inj$adj_cost)))
[1] 1.18511e+11
user system elapsed
0.735 0.311 0.989
The results are much less acceptable. Though still within the margin of error established by the survey procedure. But again, 3 hours vs. 1 second is an appreciable cost for the more precise results.
Update: 10 Feb 2016
Thanks Severin and Anthony for allowing me to borrow your synapses. Sorry for the delay in following up, has taken little time to try out both your suggestions.
Severin , you are right in your observations that Revolution Analytics/MOR build is faster for some operations. Looks like it has to do with the BLAS ("Basic Linear Algebra Subprograms") library shipped with CRAN R. It is more precise, but slower. So, I optimized the BLAS on my maching with the proprietary (but free with macs) Apple Accelerate vecLib that allows multithreading (see http://blog.quadrivio.com/2015/06/improved-r-performance-with-openblas.html). This seemed to shave some time off the operations, e.g. from 3 hours for a svyby/svymean to a bit over 2 hours.
Anthony, had less luck with the replicate weight design approach. type="bootstrap" with replicates=20 ran for about 39 hours before I quit out; type="BRR" returned error "Can't split with odd numbers of PSUs in a stratum", when I set the options to small="merge", large="merge", it ran for several hours before the OS heaved a huge sigh and ran out of application memory; type="JKn" returned he error "cannot allocate vector of size 11964693.8 Gb"
Again, many thanks for your suggestions. I will for now, resign myself to running these analyses piecemeal and over long periods of time. If I do eventually come up with a better approach, I'll post on SO