I am on the lookout for a faster alternative to R's hist(x, breaks=XXX, plot=FALSE)$count
function as I don't need any of the other output that is produced (as I want to use it in an sapply
call, requiring 1 million iterations in which this function would be called), e.g.
x = runif(100000000, 2.5, 2.6)
bincounts = hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
Any thoughts?
A first attempt using table
and cut
:
table(cut(x, breaks=seq(0,3,length.out=100)))
It avoids the extra output, but takes about 34 seconds on my computer:
system.time(table(cut(x, breaks=seq(0,3,length.out=100))))
user system elapsed
34.148 0.532 34.696
compared to 3.5 seconds for hist
:
system.time(hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count)
user system elapsed
3.448 0.156 3.605
Using tabulate
and .bincode
runs a little bit faster than hist
:
tabulate(.bincode(x, breaks=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(.bincode(x, breaks=seq(0,3,length.out=100))), nbins=100)
user system elapsed
3.084 0.024 3.107
Using tablulate
and findInterval
provides a significant performance boost relative to table
and cut
and has an OK improvement relative to hist
:
tabulate(findInterval(x, vec=seq(0,3,length.out=100)), nbins=100)
system.time(tabulate(findInterval(x, vec=seq(0,3,length.out=100))), nbins=100)
user system elapsed
2.044 0.012 2.055
Seems your best bet is to just cut out all the overhead of hist.default
.
nB1 <- 99
delt <- 3/nB1
fuzz <- 1e-7 * c(-delt, rep.int(delt, nB1))
breaks <- seq(0, 3, by = delt) + fuzz
.Call(graphics:::C_BinCount, x, breaks, TRUE, TRUE)
I pared down to this by running debugonce(hist.default)
to get a feel for exactly how hist
works (and testing with a smaller vector -- n = 100
instead of 1000000
).
Comparing:
x = runif(100, 2.5, 2.6)
y1 <- .Call(graphics:::C_BinCount, x, breaks + fuzz, TRUE, TRUE)
y2 <- hist(x, breaks=seq(0,3,length.out=100), plot=FALSE)$count
identical(y1, y2)
# [1] TRUE