I have a data.table and would like to compute stats by groups.
R) set.seed(1)
R) DT=data.table(a=rnorm(100),b=rnorm(100))
Those groups should be defined by
R) quantile(DT$a,probs=seq(.1,.9,.1))
10% 20% 30% 40% 50% 60% 70% 80% 90%
-1.05265747329 -0.61386923071 -0.37534201964 -0.07670312896 0.11390916079 0.37707993057 0.58121734252 0.77125359976 1.18106507751
How can I compute say the average of b
per bin, say if b=-.5
I am within [-0.61386923071,-0.37534201964]
so in bin 3
I do this type of thing a lot, so I wrote a pretty flexible bin_data() method for it in my R package - mltools. It's entirely
data.table
based and makes use of the new non-equi joins.To answer your specific question, set Bin1 as a column in
DT
, then group by Bin1You can do other cool stuff too
Make 10 equally-spaced bins by quantile
Make the last boundary left-closed right-open
Specify your own explicit bins (notice empty bins are returned)
Use variable size bins
How about :
To have a look at that NA (and to check the results anyway), I next did :
Aside: I've returned a
list
column (each cell is itself a vector) there to have a quick look at the values going into the bins, just to check.data.table
displays commas when printing (and shows just the first 6 items per cell), but each cell ofV3
there is actually a numeric vector.So the values outside the first and last
break
are being coded together as NA. It's not obvious to me how to tellcut
not to do that. So I just added -Inf and +Inf :That's better. Or alternatively :
That way you see what the min and max is, rather than it displaying -Inf and +Inf. Notice you need to pass
include=TRUE
tocut
otherwise 11 bins will be returned with only 1 in the first.