I have a data frame which looks like this
a b c d
1 1 1 0
1 1 1 200
1 1 1 300
1 1 2 0
1 1 2 600
1 2 3 0
1 2 3 100
1 2 3 200
1 3 1 0
I have a data frame which looks like this
a b c d
1 1 1 250
1 1 2 600
1 2 3 150
1 3 1 0
I am currently doing it {
n=nrow(subset(Wallmart, a==i & b==j & c==k ))
sum=subset(Wallmart, a==i & b==j & c==k )
#sum
sum1=append(sum1,sum(sum$d)/(n-1))
}
I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250. Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like
a b c d
1 1 1 250
This is a very inefficient way to do this work. The code takes a long time to run in a loop. so any help is appreciated that makes it run faster. The original data frame has about a million rows.
You may try
aggregate
:As noted by @Roland, for bigger data sets, you may try
data.table
ordplyr
instead, e.g.:Edit following updated question. If you want to calculate group-wise mean, excluding rows that are zero, you may try this:
However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to
NA
when preparing your data set, before the calculations.Here is another way:
Step1: Setup data table:
Step2: Do the computation:
Note that looping is not recommended here. And best strategy is to vectorize the solution, as in the example above.
Step3: Lets test for timing:
So the computation for nearly 10M records is performed in about 0.5 sec!
Hope this helps!!
This is the
data.table
solution per your last edit.Edit #2:
@Arun suggestion to speed it up
Edit #3
@eddis suggestion