Suppose the data below:
GroupId <- c(1,1,1,1,2,2,2,3,3)
IndId <- c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <- c(1,2,1,2,2,2,1,2,2)
df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df
These are multi-level data, where each group GroupId
consists of one or multiple individuals IndId
having access to one or more properties IndGroupProperty
, which are unique to their respective group (i.e. property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType
.
The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.
For our sample data, this simply is:
ValidGroup <- c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df
The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).
I have looked into several commands: levels
seems to lack group support; getGroups
in the nlme
package does not like the input of my real data; I guess that there might be something useful in doBy
, but summaryBy
does not seem to take levels
as a function.
Solution EDIT: dplyr
solution by Henrik wrapped into a function:
foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group))
indvar <- deparse(substitute(ind))
typevar <- deparse(substitute(type))
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
group_by(group, ind) %.%
mutate(type1 = any(type == 1)) %.%
group_by(group, add = FALSE) %.%
mutate(ValidGroup = all(type1) * 1) %.%
select(-type1)
))
}
You could also try
ave
:Edit Added
dplyr
alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.First wrap up the alternatives in functions:
Benchmarks
Original data:
On a tiny data set
ave
is about twice as fast asdplyr
and more than 2.5 times faster thanby
.Generate some larger data; 10 times the number of groups and individuals
dplyr
is three times faster thanave
and nearly 10 times faster thanby
.100 times the number of groups and individuals
ave
is really loosing ground now.dplyr
is nearly 30 times faster thanby
, and more than 100 times faster thanave
.Try this:
The key is using
by()
to apply a function by a grouping variable, here yourdf$GroupId
. The function to apply is an anonymous function. For each chunk (defined by the grouping variable), it creates atable
of theIndId
andPropertyType
entries. It then looks whether "1" appears at all in thePropertyType
- if not, it returnsFALSE
, if yes, it looks whether everyIndId
has at least one "1" entry (i.e., whether all entries in the "1" column of thetable
are >0).We store the result of the
by()
call in a structurebar
, which is named according to the levels in the grouping variable. This in turn allows us to roll the result back out to the originaldata.frame
. Note how I am usingas.character()
here to make sure the integers are interpreted as entry names, not entry numbers. Bad Things often happen when things have names that can be interpreted as numbers.If you really want a 0-1 result instead of
TRUE-FALSE
, just add anas.numeric()
.EDIT. Let's turn this into a function.
This still requires that the target be exactly "1", but of course this could also be included in the function definition as a parameter. Just be sure to keep column names and variables that contain column names straight.