Adding a non-aggregated column to an aggregated da

2019-07-04 03:13发布

Is it possible to use the aggregate function to add another column from the original data frame, without actually using that column to aggregate the data?

This is a very simplied version of data that will help illustrate my question (let's call it data)

name      result.1    result.2    replicate    day     data.for.mean
"obj.1"   1           "good"      1            1        5
"obj.1"   1           "good"      2            1        7
"obj.1"   1           "great"     1            2        6
"obj.1"   1           "good"      2            2        9
"obj.1"   2           "bad"       1            1        10
"obj.1"   2           "not good"  2            1        6
"obj.1"   2           "bad"       1            2        5
"obj.1"   2           "not good"  2            2        3

"obj.2"   1           "excellent" 1            1        14
"obj.2"   1           "good"      2            1        10
"obj.2"   1           "good"      1            2        11
"obj.2"   1           "not bad"   2            2        7
"obj.2"   2           "bad"       1            1        4
"obj.2"   2           "bad"       2            1        3
"obj.2"   2           "horrible"  1            2        2
"obj.2"   2           "dismal"    2            2        1

You'll notice that result.1 and result.2 are tied, such that if result.1 == 1, result.2 is good/great, and if result.1 == 2, then result.2 == bad/not good. I need both of these columns in the aggregated data set and it doesn't matter which value from result.2 is picked when the data is aggregated, I just need the information to identify whether result.1 column's 1 value is good/bad and simiarly for result.2. So it could have all values of "dismal" corresponding with all of result.1's values of 2.

The problem is that, since result.2 uses different names to identify good/bad, I cannot use it to as a column to aggregate by.

Currently my aggregate function looks like this...

aggregated.data <- aggregate(data[c("data.for.mean")], 
            by=data[c("name", "result.1", "day") ],
            FUN= mean }
        );

which would giving one line of output such as this...

name     result.1    day    data.for.mean
"obj.1"  1           1      6

(All of the replicates for obj.1, with a result.1 == 1, on day1 have been averged. They had a value of 5 and 7 and were the first two rows in my mock data set.)

What I would like would produce a line of output such as this

name     result.1    result.2    day    data.for.mean
"obj.1"  1           "good"      1      6

Again, "good" could be replaced with "great", "not bad", "excellent", for all values which correspond to result.1's value of '1'.

What would be the best method of capturing information from result.2 and adding it to aggregated.data (the output of the aggregate function)?

Thank you.

标签: r aggregate
2条回答
beautiful°
2楼-- · 2019-07-04 03:27

Here's a solution in base, which uses merge followed by another aggregate:

agg.2 <- merge(aggregated.data, data[,names(data) != 'data.for.mean'])
aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=sample, size=1)
##    name result.1 day data.for.mean  result.2
## 1 obj.2        2   2           1.5    dismal
## 2 obj.2        2   1           3.5       bad
## 3 obj.1        2   2           4.0       bad
## 4 obj.1        1   1           6.0      good
## 5 obj.1        1   2           7.5     great
## 6 obj.1        2   1           8.0  not good
## 7 obj.2        1   2           9.0   not bad
## 8 obj.2        1   1          12.0 excellent

Here's how this works:

The merge adds in the result.2 values, but will create multiple rows where there are multiple such values. Then aggregate is used to select one of these rows.

As you say you don't care which of the relevant result.2 labels you get, I'm getting one at random with sample.

To return the first result.2 label, use head with n=1 instead:

aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=head, n=1)

Similarly, to get the last such label, use tail with n=1.

查看更多
3楼-- · 2019-07-04 03:36

How about this with dplyr:

require(dplyr)
group_by(data,name,result.1,day) %.% summarise(mean=mean(data.for.mean),result.2=result.2[1])


#Source: local data frame [8 x 5]
#Groups: name, result.1

#   name result.1 day mean  result.2
#1 obj.2        1   2  9.0      good
#2 obj.2        1   1 12.0 excellent
#3 obj.1        1   1  6.0      good
#4 obj.1        1   2  7.5     great
#5 obj.1        2   2  4.0       bad
#6 obj.1        2   1  8.0       bad
#7 obj.2        2   2  1.5  horrible
#8 obj.2        2   1  3.5       bad
查看更多
登录 后发表回答