I want to partition a vector (length around 10^5) into five classes. With the function classIntervals
from package classInt
I wanted to use style = "jenks"
natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans"
executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
What makes the Jenks algorithm so slow, and is there a faster way to run it?
If need be I will move the last two parts of the question to stats.stackexchange.com:
- Under what circumstances is kmeans a reasonable substitute for Jenks?
- Is it reasonable to define classes by running classInt on a random 1% subset of the data points?
To answer your original question:
Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the
setjenksBreaks
function in theBAMMtools
package.However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the
classIntervals
function of theclassInt
package you have to set the breaks to 6 thesetjenksBreaks
function in theBAMMtools
package to get the same results.The speed up is huge, i.e.