Conditional summing (R)

2020-02-05 05:03发布

I'm trying to create a conditional sum, in order to calculate an average. The idea is that an function (or an apply statement) checks if an certain value is true (for example x > 0), then sums all the values of x that where higher than zero. The last step would be to divide this sum by the number of instances which are greater than zero. Searching for conditonal sum(ming) didn't gave me usable information.

This is a part of the data:

> tmpData
   Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
11        KFT                    -8
12        KFT                   -48
13        KFT                   125
14        KFT                  -150
15        KFT                  -206
16        KFT                   107

Of the functions that I've tried, the following holds the most promise:

avgProfit <- function(x) {
    ifelse(x > 0,
    sum(x) / length(which(x > 0)),
    return(0))
    }

However, the output of this function is 0:

> with(tmpData, tapply(TradeResult.Currency., Instrument, avgProfit))
JPM KFT 
  0   0     
> avgProfit(tmpData$TradeResult.Currency.)
[1] 0
> x
 [1] 1 1 2 1 2 3 3 3 4 4

(The values should be 225 for JPM (total of 900 divided by 4 instances which where greater than zero) and 116 for KFT)

Even though I calculate the sum of x (which, if I understand correctly, should be the sum of the individual values in the data.frame) in the function, the output of the variable 'x' puzzles me. I can't find where these 1,2,3 and fours are coming from.

How can I calculate an conditional sum? Besides, do I need to use an function or am I making it too complicated (perhaps there is an build-in R function for this which I overlooked?)

Any thoughts are more than welcome,

Regards,

6条回答
神经病院院长
2楼-- · 2020-02-05 05:33

probably it is easy way to drop unused rows first and then aggregate them:

aggregate(TradeResult.Currency.~Instrument,
  mean,
  data=subset(tmpData,TradeResult.Currency.>0))
查看更多
成全新的幸福
3楼-- · 2020-02-05 05:33

I would probably just approach this from an iterative style. Have a local variable called 'accumulator' or whatever, loop over all the elements in the list, and have an if block sort of like

if (x[index] > 0)
    accumulator = accumulator + x[index]

and return accumulator's value when you're done.

查看更多
女痞
4楼-- · 2020-02-05 05:34

You were almost there, I think ifelse was the wrong direction since you want the mean, not an element-wise comparison.

You will want to consider whether you might encounter missing values, so you can handle that properly.

tmpData <- read.table(textConnection("  Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
11        KFT                    -8
12        KFT                   -48
13        KFT                   125
14        KFT                  -150
15        KFT                  -206
16        KFT                   107"))



with(tmpData, tapply(TradeResult.Currency., Instrument, function(x) mean(x[x > 0])))

JPM KFT 225 116

查看更多
家丑人穷心不美
5楼-- · 2020-02-05 05:41

There have been quite a few of these data aggregation / conditional analysis question lately. It's always interesting to see the different approaches used. I thought I would add in an approach using plyr. I like the plyr functions because they provide a standard syntax for all functions and allow you to specify structure of both the input and the output. Here we will use ddply since we are passing in a data.frame and want a data.frame back out on the other side. We use the summarise function to calculate the mean for each instrument where the values are positive.

library(plyr)
ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0]))

To follow up on @Joris performance comparison, ddply seems to perform as well if not better than other approaches:

> tmpData <- data.frame(
+      instrument = rep(c("JPM","KFT"),each=10e6),
+      TCurr = runif(20e6,-10,100)
+  )
> 
>  system.time(
+ ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0]))
+  )
   user  system elapsed 
   4.43    0.89    5.32 
>  
>  avgProfit <- function(x) { mean(x[x>0])}
>  
>  system.time(
+ with(tmpData,tapply(TCurr,instrument,avgProfit))
+ )
   user  system elapsed 
   7.88    0.47    8.36 
>  
> system.time(
+ aggregate(TCurr~instrument,mean,data=subset(tmpData,TCurr>0))
+ )
   user  system elapsed 
  28.29    2.35   30.65 
查看更多
何必那么认真
6楼-- · 2020-02-05 05:51

There is a really simple and fast data.table approach to this:

library(data.table)

setDT(dt)[, .(avg = mean(TradeResult.Currency.[which(TradeResult.Currency.>0 )])), by= Instrument]

#    Instrument avg
# 1:        JPM 225
# 2:        KFT 116

Benchmark: Using @Joris and @Chase's performance comparison, this solution is almost five times faster than the ddply approach and 40 times faster than the aggregate approach.

tmpData <- data.frame(
        instrument = rep(c("JPM","KFT"),each=10e6),
        TCurr = runif(20e6,-10,100))

system.time( ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0]))  )
# user  system elapsed 
# 1.41    0.62    2.03 

system.time( setDT(tmpData)[, .(avg = mean(TCurr[which(TCurr>0 )])), by= instrument]  )
# user  system elapsed 
# 0.36    0.18    0.43

system.time( aggregate(TCurr~instrument, mean, data=subset(tmpData,TCurr>0)) )
#  user  system elapsed 
# 16.07    1.81   17.20 
查看更多
在下西门庆
7楼-- · 2020-02-05 05:55

aggregate is the easiest way probably, but I don't agree with the "cleaner because you don't have to write a custom function." Readibility goes up when defining some clear functions. Especially if you need that average a couple of times in your scripts.

Aggregate is quite a bit faster than your custom function though, because you forgot about indices. You wanted to do this :

avgProfit <- function(x){
  mean(x[x>0])
}

This is again faster than aggregate, due to the lack of overhead :

> tmpData <- data.frame(
+     instrument = rep(c("JPM","KFT"),each=10000),
+     TCurr = runif(20000,-10,100)
+ )

> system.time(
+   with(tmpData,tapply(TCurr,instrument,avgProfit)))
   user  system elapsed 
   0.02    0.00    0.02 

> system.time(
+   aggregate(TCurr~instrument,mean,data=subset(tmpData,TCurr>0)))
   user  system elapsed 
   0.09    0.00    0.10 

In most cases you can just ignore that difference. On huge datasets (n > 100,000 ) you'll start to feel it, especially if you need to do this for a whole set of variables.

EDIT : just saw that mdsummer had exactly the same solution hidden neatly between the output :-). I leave this as a timing reference.

查看更多
登录 后发表回答