可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a data frame which looks like this
a b c d
1 1 1 0
1 1 1 200
1 1 1 300
1 1 2 0
1 1 2 600
1 2 3 0
1 2 3 100
1 2 3 200
1 3 1 0
I have a data frame which looks like this
a b c d
1 1 1 250
1 1 2 600
1 2 3 150
1 3 1 0
I am currently doing it
{
n=nrow(subset(Wallmart, a==i & b==j & c==k ))
sum=subset(Wallmart, a==i & b==j & c==k )
#sum
sum1=append(sum1,sum(sum$d)/(n-1))
}
I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250.
Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like
a b c d
1 1 1 250
This is a very inefficient way to do this work. The code takes a long time to run in a loop.
so any help is appreciated that makes it run faster. The original data frame has about a million rows.
回答1:
You may try aggregate
:
aggregate(d ~ a + b + c, data = df, sum)
# a b c d
# 1 1 1 1 500
# 2 1 3 1 0
# 3 1 1 2 600
# 4 1 2 3 300
As noted by @Roland, for bigger data sets, you may try data.table
or dplyr
instead, e.g.:
library(dplyr)
df %>%
group_by(a, b, c) %>%
summarise(
sum_d = sum(d))
# Source: local data frame [4 x 4]
# Groups: a, b
#
# a b c sum_d
# 1 1 1 1 500
# 2 1 1 2 600
# 3 1 2 3 300
# 4 1 3 1 0
Edit following updated question.
If you want to calculate group-wise mean, excluding rows that are zero, you may try this:
aggregate(d ~ a + b + c, data = df, function(x) mean(x[x > 0]))
# a b c d
# 1 1 1 1 250
# 2 1 3 1 NaN
# 3 1 1 2 600
# 4 1 2 3 150
df %>%
filter(d != 0) %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to NA
when preparing your data set, before the calculations.
df$d[df$d == 0] <- NA
df %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d, na.rm = TRUE))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
# 4 1 3 1 NaN
回答2:
This is the data.table
solution per your last edit.
library(data.table)
DT <- setDT(df)[, if(any(d[d > 0])) mean(d[d > 0]) else 0, by = c("a","b","c")]
# a b c V1
# 1: 1 1 1 250
# 2: 1 1 2 600
# 3: 1 2 3 150
# 4: 1 3 1 0
Edit #2:
@Arun suggestion to speed it up
setDT(df)[, mean(d[d > 0]), by = c("a","b","c")][is.nan(V1), V1 := 0]
Edit #3
@eddis suggestion
setDT(df)[, sum(d) / pmax(1, sum(d > 0)), by = list(a, b, c)]
回答3:
Here is another way:
Step1: Setup data table:
df <- read.table(text=" a b c d
1 1 1 0
1 1 1 200
1 1 1 300
1 1 2 0
1 1 2 600
1 2 3 0
1 2 3 100
1 2 3 200
1 3 1 0",header=T)
library(data.table)
setDT(df)
setkey(df,a,b,c)
Step2: Do the computation:
df[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(df)]
Note that looping is not recommended here.
And best strategy is to vectorize the solution, as in the example above.
Step3: Lets test for timing:
> dt<-df
> for(i in 1:20) dt <- rbind(dt,dt)
> dim(dt)
[1] 9437184 4
> setkey(dt,a,b,c)
> dt[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(dt)]
a b c V1
1: 1 1 1 250
2: 1 1 2 600
3: 1 2 3 150
4: 1 3 1 0
> system.time(dt[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(dt)])
user system elapsed
0.495 0.090 0.609
So the computation for nearly 10M records is performed in about 0.5 sec!
Hope this helps!!