Binning data in R

2019-02-10 07:41发布

问题:

I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins).

v<-c(1:4000)

V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals

回答1:

Use cut and tapply:

> tapply(v, cut(v, 60), median)
          (-3,67.7]          (67.7,134]           (134,201]           (201,268] 
               34.0               101.0               167.5               234.0 
          (268,334]           (334,401]           (401,468]           (468,534] 
              301.0               367.5               434.0               501.0 
          (534,601]           (601,668]           (668,734]           (734,801] 
              567.5               634.0               701.0               767.5 
          (801,867]           (867,934]         (934,1e+03]    (1e+03,1.07e+03] 
              834.0               901.0               967.5              1034.0 
(1.07e+03,1.13e+03]  (1.13e+03,1.2e+03]  (1.2e+03,1.27e+03] (1.27e+03,1.33e+03] 
             1101.0              1167.5              1234.0              1301.0 
 (1.33e+03,1.4e+03]  (1.4e+03,1.47e+03] (1.47e+03,1.53e+03]  (1.53e+03,1.6e+03] 
             1367.5              1434.0              1500.5              1567.0 
 (1.6e+03,1.67e+03] (1.67e+03,1.73e+03]  (1.73e+03,1.8e+03]  (1.8e+03,1.87e+03] 
             1634.0              1700.5              1767.0              1834.0 
(1.87e+03,1.93e+03]    (1.93e+03,2e+03]    (2e+03,2.07e+03] (2.07e+03,2.13e+03] 
             1900.5              1967.0              2034.0              2100.5 
 (2.13e+03,2.2e+03]  (2.2e+03,2.27e+03] (2.27e+03,2.33e+03]  (2.33e+03,2.4e+03] 
             2167.0              2234.0              2300.5              2367.0 
 (2.4e+03,2.47e+03] (2.47e+03,2.53e+03]  (2.53e+03,2.6e+03]  (2.6e+03,2.67e+03] 
             2434.0              2500.5              2567.0              2634.0 
(2.67e+03,2.73e+03]  (2.73e+03,2.8e+03]  (2.8e+03,2.87e+03] (2.87e+03,2.93e+03] 
             2700.5              2767.0              2833.5              2900.0 
   (2.93e+03,3e+03]    (3e+03,3.07e+03] (3.07e+03,3.13e+03]  (3.13e+03,3.2e+03] 
             2967.0              3033.5              3100.0              3167.0 
 (3.2e+03,3.27e+03] (3.27e+03,3.33e+03]  (3.33e+03,3.4e+03]  (3.4e+03,3.47e+03] 
             3233.5              3300.0              3367.0              3433.5 
(3.47e+03,3.53e+03]  (3.53e+03,3.6e+03]  (3.6e+03,3.67e+03] (3.67e+03,3.73e+03] 
             3500.0              3567.0              3633.5              3700.0 
 (3.73e+03,3.8e+03]  (3.8e+03,3.87e+03] (3.87e+03,3.93e+03]    (3.93e+03,4e+03] 
             3767.0              3833.5              3900.0              3967.0


回答2:

In the past, i've used this function

evenbins <- function(x, bin.count=10, order=T) {
    bin.size <- rep(length(x) %/% bin.count, bin.count)
    bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
    bin <- rep(1:bin.count, bin.size)
    if(order) {    
        bin <- bin[rank(x,ties.method="random")]
    }
    return(factor(bin, levels=1:bin.count, ordered=order))
}

and then i can run it with

v.bin <- evenbins(v, 60)

and check the sizes with

table(v.bin)

and see they all contain 66 or 67 elements. By default this will order the values just like cut will so each of the factor levels will have increasing values. If you want to bin them based on their original order,

v.bin <- evenbins(v, 60, order=F)

instead. This just split the data up in the order it appears



回答3:

This result shows the 59 median values of the break-points. The 60 bin values are probably as close to equal as possible (but probably not exactly equal).

> sq <- seq(1, 4000, length = 60)
> sapply(2:length(sq), function(i) median(c(sq[i-1], sq[i])))
# [1]   34.88983  102.66949  170.44915  238.22881  306.00847  373.78814
# [7]  441.56780  509.34746  577.12712  644.90678  712.68644  780.46610
#  ......

Actually, after checking, the bins are pretty darn close to being equal.

> unique(diff(sq))
# [1] 67.77966 67.77966 67.77966


标签: r binning