I'm trying to create a conditional sum, in order to calculate an average. The idea is that an function (or an apply statement) checks if an certain value is true (for example x > 0), then sums all the values of x that where higher than zero. The last step would be to divide this sum by the number of instances which are greater than zero. Searching for conditonal sum(ming) didn't gave me usable information.
This is a part of the data:
> tmpData
Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT -206
16 KFT 107
Of the functions that I've tried, the following holds the most promise:
avgProfit <- function(x) {
ifelse(x > 0,
sum(x) / length(which(x > 0)),
return(0))
}
However, the output of this function is 0:
> with(tmpData, tapply(TradeResult.Currency., Instrument, avgProfit))
JPM KFT
0 0
> avgProfit(tmpData$TradeResult.Currency.)
[1] 0
> x
[1] 1 1 2 1 2 3 3 3 4 4
(The values should be 225 for JPM (total of 900 divided by 4 instances which where greater than zero) and 116 for KFT)
Even though I calculate the sum of x (which, if I understand correctly, should be the sum of the individual values in the data.frame) in the function, the output of the variable 'x' puzzles me. I can't find where these 1,2,3 and fours are coming from.
How can I calculate an conditional sum? Besides, do I need to use an function or am I making it too complicated (perhaps there is an build-in R function for this which I overlooked?)
Any thoughts are more than welcome,
Regards,
probably it is easy way to drop unused rows first and then aggregate them:
I would probably just approach this from an iterative style. Have a local variable called 'accumulator' or whatever, loop over all the elements in the list, and have an if block sort of like
and return accumulator's value when you're done.
You were almost there, I think ifelse was the wrong direction since you want the mean, not an element-wise comparison.
You will want to consider whether you might encounter missing values, so you can handle that properly.
JPM KFT 225 116
There have been quite a few of these data aggregation / conditional analysis question lately. It's always interesting to see the different approaches used. I thought I would add in an approach using
plyr
. I like theplyr
functions because they provide a standard syntax for all functions and allow you to specify structure of both the input and the output. Here we will useddply
since we are passing in adata.frame
and want adata.frame
back out on the other side. We use thesummarise
function to calculate the mean for each instrument where the values are positive.To follow up on @Joris performance comparison,
ddply
seems to perform as well if not better than other approaches:There is a really simple and fast
data.table
approach to this:Benchmark: Using @Joris and @Chase's performance comparison, this solution is almost five times faster than the
ddply
approach and 40 times faster than theaggregate
approach.aggregate is the easiest way probably, but I don't agree with the "cleaner because you don't have to write a custom function." Readibility goes up when defining some clear functions. Especially if you need that average a couple of times in your scripts.
Aggregate is quite a bit faster than your custom function though, because you forgot about indices. You wanted to do this :
This is again faster than aggregate, due to the lack of overhead :
In most cases you can just ignore that difference. On huge datasets (n > 100,000 ) you'll start to feel it, especially if you need to do this for a whole set of variables.
EDIT : just saw that mdsummer had exactly the same solution hidden neatly between the output :-). I leave this as a timing reference.