I wish to bucket/group/bin data :
C1 C2 C3
49488.01172 0.0512 54000
268221.1563 0.0128 34399
34775.96094 0.0128 54444
13046.98047 0.07241 61000
2121699.75 0.00453 78921
71155.09375 0.0181 13794
1369809.875 0.00453 12312
750 0.2048 43451
44943.82813 0.0362 49871
85585.04688 0.0362 18947
31090.10938 0.0362 13401
68550.40625 0.0181 14345
I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.
I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?
EDIT: added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket
From the comments, "C2" seems to be "character" column with
%
as suffix. Before, creating a group, remove the%
usingsub
, convert to "numeric" (as.numeric
). The variable "group" is created (transform(df,...)
) by using the functioncut
withbreaks
(group buckets/intervals) andlabels
(for the desired group labels) arguments. Once the group variable is created, thesum
of the "C1" by "group" and the "count" of elements within "group" can be done usingaggregate
from "base R"or you can use
data.table
.setDT
converts thedata.frame
todata.table
. Specify the "grouping" variable withby=
and summarize/create the two variables "Count" and "Sum" within thelist(
..N
gives the count of elements within each "group".Or using
dplyr
. The%>%
connect the LHS with RHS arguments and chains them together. Usegroup_by
to specify the "group" variable, and then usesummarise_each
orsummarise
to get summary count andsum
of the concerned column.summarise_each
would be useful if there are more than one column.Update
Using the new dataset
df
and you can do the
merge
as detailed above.The
dplyr
approach would be the same except specifying the additional variabledata