How collect additional row data on binned data in

2020-05-01 07:48发布

问题:

I want sort the values of one data.frame column into predetermined bins, and then sum values that are in the same rows, but a different column.What I'm trying to do is sort dataframe column items into bins based on one value and then get a sum of a second value attached to the items for all of the items in the bin. Can someone help me?

My data looks like this

df =

 Item              valueX        valueY     
A                 169849631      0.9086560 
B                  27612064      0.9298379 
C                 196651878      1.6516654 
D                  33007984      1.3397873 
E                  23019448     -0.2954385 
F                  54779712     -1.6888178

My bins looks like this

Bins=
start                end
     1               249982
249983               499963
499964               749945
749946               999926
999927              1249907
1249908              1499889

What I want a data frame that looks like this (hypothetical values in frequency and sumvalueY columns)

resultsdf=
    binstart             binend       frequency       sumvalueY 
          1               249982             0               0
     249983               499963             5             200
     499964               749945             6             400 
     749946               999926             0               0
     999927              1249907            12              30
    1249908              1499889             0               0

Here is my code (current iteration)

Start = Bins[,1]

End = Bins[,2]

myfunction <- function(Start,End) {
  sum(df$valueX >= Start & df$valueX < End, df[,2])}

Binssorted = mapply(myfunction, Start,End)

回答1:

There are a number of ways to do this. Here's one using the dplyr package. I've created some fake data for illustration.

library(dplyr)

# Fake data
set.seed(5) # For reproducibility
dat = data.frame(valueX = runif(1000, 1, 2e6), valueY = rnorm(1000))

Now we'll bin the data and summarise it using the pipe operator %>% which allows us to chain functions one after the other, so we can perform all the operations in one go.

dat %>% 
  # Bin the data
  mutate(bins = cut(valueX, seq(0, 2e6, 250000))) %>%
  # Group data into the bins we just created
  group_by(bins) %>%
  # Count the number of rows in each bin and sum all the Y values in each bin
  summarise(freq = n(),
            sumY = sum(valueY))

                bins  freq       sumY
1        (0,2.5e+05]   127   8.404150
2    (2.5e+05,5e+05]   127  14.988207
3    (5e+05,7.5e+05]   121  10.750580
4    (7.5e+05,1e+06]   134 -28.725949
5   (1e+06,1.25e+06]   106  17.377665
6 (1.25e+06,1.5e+06]   126  14.340313
7 (1.5e+06,1.75e+06]   119  -4.241991
8   (1.75e+06,2e+06]   140   9.312233


标签: r binning