R How to Get the Average of One Variable based on

2019-03-28 05:34发布

If I have a series of observations with two variables X and Y, how can I get the average value of Y based on ranges of variable X?

So for example, with some data like:

df = data.frame(x=runif(50,1,100),y=runif(50,300,700))

How could I get the answer to "When X is 1-10 the average of y 332.4, when X is 11-20 the average of y is 632.3, etc...."

6条回答
We Are One
2楼-- · 2019-03-28 05:41

One way is to use cut() to create a factor from the x variable, specifying breaks every ten units. Given that factor, you can then use by() or aggregate() or ... to summarise the data frame, or rather just column y:

R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
       x               y         
 Min.   : 1.39   Min.   :-179.5  
 1st Qu.:40.66   1st Qu.: -19.4  
 Median :64.45   Median :  39.6  
 Mean   :60.29   Mean   :  25.9  
 3rd Qu.:90.10   3rd Qu.:  74.7  
 Max.   :98.90   Max.   : 140.3  
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
--------------------------------------------- 
DF$cx: (10,20]
[1] 52.9104
--------------------------------------------- 
DF$cx: (20,30]
[1] -53.8961
--------------------------------------------- 
DF$cx: (30,40]
[1] 44.1992
--------------------------------------------- 
DF$cx: (40,50]
[1] 21.7404
--------------------------------------------- 
DF$cx: (50,60]
[1] 16.2122
--------------------------------------------- 
DF$cx: (60,70]
[1] -27.0338
--------------------------------------------- 
DF$cx: (70,80]
[1] 42.283
--------------------------------------------- 
DF$cx: (80,90]
[1] 40.8042
--------------------------------------------- 
DF$cx: (90,100]
[1] 38.8917
R> 

Or using ddply():

R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
         cx       V1
1    (0,10]  67.8747
2   (10,20]  52.9104
3   (20,30] -53.8961
4   (30,40]  44.1992
5   (40,50]  21.7404
6   (50,60]  16.2122
7   (60,70] -27.0338
8   (70,80]  42.2830
9   (80,90]  40.8042
10 (90,100]  38.8917
R> 
查看更多
Lonely孤独者°
3楼-- · 2019-03-28 05:56

Use cut to form groups and tapply to summarise over them.

df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))

If you are a plyr fan you may prefer

library(plyr)
ddply(df, .(grp), summarise, m = mean(y))

For completeness, the aggregate version is

aggregate(y ~ grp, df, mean)
查看更多
闹够了就滚
4楼-- · 2019-03-28 05:57

I think your question is causing your answers to be too narrow. You ought to be thinking of regression methods to summarize the joint relationships of continuous variables. Plotting with scatterplots and fitting regression splines is going to do less violence to the underlying relationships than the piecewise analysis that you specified.

查看更多
一夜七次
5楼-- · 2019-03-28 05:57

You can use tapply with pretty to make the breakpoints for cut:

 tapply(df$y,cut(df$x,pretty(range(df$x),high.u.bias=0.1)),mean)
  (0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80] 
496.9840 510.4164 502.4092 492.5806 493.3364 549.5207 507.4511 472.3391 
 (80,90] (90,100] 
479.8795 482.6728 

aggregate can also be used:

aggregate(df$y,list(cut(df$x,pretty(range(df$x),high.u.bias=0.1))),FUN=mean)
    Group.1        x
1    (0,10] 496.9840
2   (10,20] 510.4164
3   (20,30] 502.4092
4   (30,40] 492.5806
5   (40,50] 493.3364
6   (50,60] 549.5207
7   (60,70] 507.4511
8   (70,80] 472.3391
9   (80,90] 479.8795
10 (90,100] 482.6728
查看更多
爷、活的狠高调
6楼-- · 2019-03-28 05:59

Here is the data.table solution

require(data.table)
data.table(df)[,list(mean_y = mean(y)), by = 'cut(x, seq(0, 100, 10))']
查看更多
来,给爷笑一个
7楼-- · 2019-03-28 06:05

Cut your x using cut and then use ddply in package plyr:

> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))

library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
     xrange   mean_y
1    (0,10] 490.7571
2   (10,20] 462.6347
3   (20,30] 507.5614
4   (30,40] 482.6004
5   (40,50] 510.3081
6   (50,60] 480.7927
7   (60,70] 507.8944
8   (70,80] 458.4668
9   (80,90] 501.9672
10 (90,100] 493.4844
查看更多
登录 后发表回答