可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a data frame df
with rows that are duplicates for the names column but not for the values column:
name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y
I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:
name value etc1 etc2
A 10 1 X
B 2 1 Y
C 45 1 Y
I have tried to use df[duplicated(df$name),]
but of course this does not give me the mean over the duplicates. I would like to use aggregate()
, but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?
回答1:
Here a data.table
solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)
library(data.table)
dat <- read.table(text='name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
name etc1 etc2 mm
1: A 1 X 10
2: B 1 Y 2
3: C 1 Y 45
EDIT extend to more than one value variable
In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this
name value etc1 etc2 value1
1 A 9 1 X 2.1763485
2 A 10 1 X -0.7954326
3 A 11 1 X -0.5839844
4 B 2 1 Y -0.5188709
5 C 40 1 Y -0.8300233
6 C 50 1 Y -0.7787496
The above solution can be extended like this :
X[,lapply(.SD,mean),keys]
name etc1 etc2 value value1
1: A 1 X 10 0.2656438
2: B 1 Y 2 -0.5188709
3: C 1 Y 45 -0.8043865
This will compute the mean for all variables that don't exist in keys list.
回答2:
You can use aggregate()
function like below:
aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
回答3:
The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:
sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))
sample.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
回答4:
Assuming your dataframe is df.
install.packages("plyr")
library(plyr)
df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))
df.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y