Partition into classes: jenks vs kmeans

I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.

library(classInt)

my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
   user  system elapsed 
  13.46    0.00   13.45 

system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
   user  system elapsed 
   0.02    0.00    0.02

What makes the Jenks algorithm so slow, and is there a faster way to run it?

If need be I will move the last two parts of the question to stats.stackexchange.com:

Under what circumstances is kmeans a reasonable substitute for Jenks?
Is it reasonable to define classes by running classInt on a random 1% subset of the data points?

标签： r intervals

1条回答

戒情不戒烟

2楼-- · 2019-02-12 14:17

To answer your original question:

What makes the Jenks algorithm so slow, and is there a faster way to run it?

Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.

However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.

# Install and load library
install.packages("BAMMtools")
library(BAMMtools)

# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

# Apply function
getJenksBreaks(x, 6)

The speed up is huge, i.e.

> microbenchmark( getJenksBreaks(x, 6, subset = NULL),  classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
                                      expr         min          lq        mean      median          uq         max neval cld
       getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771    10  a 
 classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846    10

0人赞添加讨论(0) 举报

Partition into classes: jenks vs kmeans

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间