This question already has an answer here:
-
Calculate group mean (or other summary stats) and assign to original data
4 answers
Using R, I would like to create a new column (MaxAct) showing the maximum numbers of a different column (ActNo) while grouping by two factors (HHID and PERID)
For example, I have this data set:
UID HHID PERID ActNo
1 1000 1 1
2 1000 1 2
3 1000 1 3
4 1000 2 1
5 1000 2 2
6 2000 1 1
7 2000 1 2
8 2000 1 3
9 2000 1 4
10 2000 2 1
11 2000 2 2
Then I want to add the new column (MaxAct) as follows:
UID HHID PERID ActNo MaxAct
1 1000 1 1 3
2 1000 1 2 3
3 1000 1 3 3
4 1000 2 1 2
5 1000 2 2 2
6 2000 1 1 4
7 2000 1 2 4
8 2000 1 3 4
9 2000 1 4 4
10 2000 2 1 2
11 2000 2 2 2
dat$MaxAct <- with(dat, ave(ActNo, HHID, PERID, FUN=max) )
For problems involving single vectors and grouping where you want the length of the result to equal the row count, ave
is your function of choice. For more complicated problems, the lapply(split(dat, fac), FUN)
approach may be needed or use do.call(rbind, by( ...))
If you have missing values:
dat$MaxAct <- with(dat, ave(ActNo, HHID, PERID, FUN=function(x) max(x, na.rm=TRUE) ) )
This is standard fare for plyr
with mutate
or transform
, base R ave
or data.table
(which might be considered a sledgehammer for a peanuts here).
The plyr
and ave
approaches has been addressed so
data.table
library(data.table)
DT <- data.table(DF)
DT[,MaxAct := max(ActNo), by = list(HHID, PERID)]
Given the size of the data the memory efficient and fast nature of data.table
is perhaps not required.
having read your previous question How to Create a Column of Ranks While Grouping in R, so we know that max(ActNo)
is simply the number of rows in each group then
DT[,MaxAct := .N, by = list(HHID, PERID)]
will work, and be marginally quicker.
There are several approaches in R to do achieve this task. For me, the easiest way to do this is to use the plyr
package
require(plyr)
ddply(dat, .(HHID, PERID), transform, MaxAct = max(ActNo))
UID HHID PERID ActNo MaxAct
1 1 1000 1 1 3
2 2 1000 1 2 3
3 3 1000 1 3 3
4 4 1000 2 1 2
5 5 1000 2 2 2
6 6 2000 1 1 4
7 7 2000 1 2 4
8 8 2000 1 3 4
9 9 2000 1 4 4
10 10 2000 2 1 2
11 11 2000 2 2 2
df <- read.table(textConnection("UID HHID PERID ActNo
1 1000 1 1
2 1000 1 2
3 1000 1 3
4 1000 2 1
5 1000 2 2
6 2000 1 1
7 2000 1 2
8 2000 1 3
9 2000 1 4
10 2000 2 1
11 2000 2 2"), header=T)
> ddply(df, .(HHID, PERID), transform, MaxAct = length(unique(ActNo)) )
UID HHID PERID ActNo MaxAct
1 1 1000 1 1 3
2 2 1000 1 2 3
3 3 1000 1 3 3
4 4 1000 2 1 2
5 5 1000 2 2 2
6 6 2000 1 1 4
7 7 2000 1 2 4
8 8 2000 1 3 4
9 9 2000 1 4 4
10 10 2000 2 1 2
11 11 2000 2 2 2