I have a data frame that is some 35,000 rows, by 7 columns. it looks like this:
head(nuc)
chr feature start end gene_id pctAT pctGC length
1 1 CDS 67000042 67000051 NM_032291 0.600000 0.400000 10
2 1 CDS 67091530 67091593 NM_032291 0.609375 0.390625 64
3 1 CDS 67098753 67098777 NM_032291 0.600000 0.400000 25
4 1 CDS 67101627 67101698 NM_032291 0.472222 0.527778 72
5 1 CDS 67105460 67105516 NM_032291 0.631579 0.368421 57
6 1 CDS 67108493 67108547 NM_032291 0.436364 0.563636 55
gene_id is a factor, that has about 3,500 unique levels. I want to, for each level of gene_id get the min(start)
, max(end)
, mean(pctAT)
, mean(pctGC)
, and sum(length)
.
I tried using lapply and do.call for this, but it's taking forever +30 minutes to run.
the code I'm using is:
nuc_prof = lapply(levels(nuc$gene_id), function(gene){
t = nuc[nuc$gene_id==gene, ]
return(list(gene_id=gene, start=min(t$start), end=max(t$end), pctGC =
mean(t$pctGC), pct = mean(t$pctAT), cdslength = sum(t$length)))
})
nuc_prof = do.call(rbind, nuc_prof)
I'm certain I'm doing something wrong to slow this down. I haven't waited for it to finish as I'm sure it can be faster. Any ideas?
Since I'm in an evangelizing mood ... here's what the fast data.table
solution would look like:
library(data.table)
dt <- data.table(nuc, key="gene_id")
dt[,list(A=min(start),
B=max(end),
C=mean(pctAT),
D=mean(pctGC),
E=sum(length)), by=key(dt)]
# gene_id A B C D E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2: ZZZ 67000042 67108547 0.5582567 0.4417433 283
do.call
can be extremely slow on large objects. I think this is due to how it constructs the call, but I'm not certain. A faster alternative would be the data.table
package. Or, as @Andrie suggested in a comment, use tapply
for each calculation and cbind
the results.
A note on your current implementation: rather than doing the subsetting in your function, you could use the split
function to break up your data.frame into a list of data.frames you can loop over.
g <- function(tnuc) {
list(gene_id=tnuc$gene_id[1], start=min(tnuc$start), end=max(tnuc$end),
pctGC=mean(tnuc$pctGC), pct=mean(tnuc$pctAT), cdslength=sum(tnuc$length))
}
nuc_prof <- lapply(split(nuc, nuc$gene_id), g)
As others have mentioned - do.call
has issues with large objects, and I recently discovered exactly how slow it is on large data sets. To illustrate the problem, here's a benchamark using a simple summary call with a large regression object (a cox regression using the rms-package):
> model <- cph(Surv(Time, Status == "Cardiovascular") ~
+ Group + rcs(Age, 3) + cluster(match_group),
+ data=full_df,
+ x=TRUE, y=TRUE)
> system.time(s_reg <- summary(object = model))
user system elapsed
0.00 0.02 0.03
> system.time(s_dc <- do.call(summary, list(object = model)))
user system elapsed
282.27 0.08 282.43
> nrow(full_df)
[1] 436305
While the data.table
solution is an excellent approach to above it does not contain the full functionality of the do.call
and I therefore thought that I'll share my fastDoCall
function - a modification of Hadley Wickhams suggested hack on the R-mailing list. It is available in the Gmisc-package 1.0-version (not yet released on CRAN but you can find it here). The benchmark is:
> system.time(s_fc <- fastDoCall(summary, list(object = model)))
user system elapsed
0.03 0.00 0.06
The full code for the function is as below:
fastDoCall <- function(what, args, quote = FALSE, envir = parent.frame()){
if (quote)
args <- lapply(args, enquote)
if (is.null(names(args))){
argn <- args
args <- list()
}else{
# Add all the named arguments
argn <- lapply(names(args)[names(args) != ""], as.name)
names(argn) <- names(args)[names(args) != ""]
# Add the unnamed arguments
argn <- c(argn, args[names(args) == ""])
args <- args[names(args) != ""]
}
if (class(what) == "character"){
if(is.character(what)){
fn <- strsplit(what, "[:]{2,3}")[[1]]
what <- if(length(fn)==1) {
get(fn[[1]], envir=envir, mode="function")
} else {
get(fn[[2]], envir=asNamespace(fn[[1]]), mode="function")
}
}
call <- as.call(c(list(what), argn))
}else if (class(what) == "function"){
f_name <- deparse(substitute(what))
call <- as.call(c(list(as.name(f_name)), argn))
args[[f_name]] <- what
}else if (class(what) == "name"){
call <- as.call(c(list(what, argn)))
}
eval(call,
envir = args,
enclos = envir)
}