Fast crosstabs and stats on all pairs of variables

2019-07-11 03:57发布

问题:

I am trying to calculate a measure of association between all variables in a data.table. (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.)

Example dataset:

p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table( 
         data.frame(
           cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2),
                  matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2) ),
         stringsAsFactors=TRUE ) )

I am currently using the split-apply-combine approach, which involves looping (via plyr::adply) through all pairs of indices and returning one row for each pair. (I attempted to parallelize adply but failed.)

# Calculate Cramér's V between all variables -- my kludgey approach

pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df

# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
        complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
        chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
                                    formula = paste( names(obs)[ix], collapse='~'), 
                                    value.var = names(obs)[ix][1], # arbitrary
                                    fun.aggregate=length)[,-1, with=FALSE] )
        return(data.table(index_1 = ix[1],
                          var_1 =  names(obs)[ix][1],
                          index_2 = ix[2],
                          var_2 =  names(obs)[ix][2],
                          cramers_v = sqrt(chsq$statistic / 
                                             (sum(chsq$observed) *
                                                (pmin(nrow(chsq$observed),
                                                      ncol(chsq$observed) ) -1  ) )
                          ) ) 
        )
      })[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) : 
                                       # task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table   
# stopCluster(cl)

What are my options for speeding up this calculation? My challenge is in passing the row-wise operation on pairs into the column-wise calculations in obs. I am wondering if it is possible to generate the column pairs directly into J, but the Force is just not strong enough with this data.table padawan.

回答1:

First, I would go with 'long' data format as following:

obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')

Next set key on data table setkeyv(mobs, 'id').

Finally, iterate through variables and do calculations on pairs:

out <- list()
for(i in 1:p) {
  vari <- paste0('X', i)
  tmp <- mobs[mobs[variable == vari]]
  nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
  cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
  setkeyv(cj, c('variable', 'value', 'i.value'))
  nn <- nn[cj]
  nn[is.na(nij), nij := 0]
  nn[, ni := sum(nij), by = list(variable, i.value)]
  nn[, nj := sum(nij), by = list(variable, value)]
  nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
  out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)

So you need to iterate only once through variables. As you see I would also wouldn't use chisq.test and would write computations myself.