可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Suppose the data below:
GroupId <- c(1,1,1,1,2,2,2,3,3)
IndId <- c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <- c(1,2,1,2,2,2,1,2,2)
df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df
These are multi-level data, where each group GroupId
consists of one or multiple individuals IndId
having access to one or more properties IndGroupProperty
, which are unique to their respective group (i.e. property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType
.
The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.
For our sample data, this simply is:
ValidGroup <- c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df
The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1).
The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0).
The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).
I have looked into several commands: levels
seems to lack group support; getGroups
in the nlme
package does not like the input of my real data; I guess that there might be something useful in doBy
, but summaryBy
does not seem to take levels
as a function.
Solution EDIT: dplyr
solution by Henrik wrapped into a function:
foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group))
indvar <- deparse(substitute(ind))
typevar <- deparse(substitute(type))
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
group_by(group, ind) %.%
mutate(type1 = any(type == 1)) %.%
group_by(group, add = FALSE) %.%
mutate(ValidGroup = all(type1) * 1) %.%
select(-type1)
))
}
回答1:
You could also try ave
:
# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))
# GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1 1 1 1 1 1
# 2 1 1 2 2 1
# 3 1 2 1 1 1
# 4 1 2 2 2 1
# 5 2 3 3 2 0
# 6 2 4 3 2 0
# 7 2 4 4 1 0
# 8 3 5 5 2 0
# 9 3 5 6 2 0
Edit Added dplyr
alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.
First wrap up the alternatives in functions:
fun_ave <- function(df){
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df
}
library(dplyr)
fun_dp <- function(df){
df %.%
group_by(GroupId, IndId) %.%
mutate(
type1 = any(PropertyType == 1)) %.%
group_by(GroupId, add = FALSE) %.%
mutate(
ValidGroup = all(type1) * 1) %.%
select(-type1)
}
fun_by <- function(df){
bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
foo <- table(xx$IndId,xx$PropertyType)
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}
Benchmarks
Original data:
microbenchmark(
fun_ave(df),
fun_dp(df),
fun_by(df))
# Unit: microseconds
# expr min lq median uq max neval
# fun_ave(df) 497.964 519.8215 538.8275 563.5355 651.535 100
# fun_dp(df) 851.861 870.6765 931.1170 968.5590 1760.360 100
# fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607 100
On a tiny data set ave
is about twice as fast as dplyr
and more than 2.5 times faster than by
.
Generate some larger data; 10 times the number of groups and individuals
GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 2.928865 3.185259 3.270978 3.435002 5.151457 100
# fun_dp(df2) 1.079176 1.231226 1.273610 1.352866 2.717896 100
# fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680 100
dplyr
is three times faster than ave
and nearly 10 times faster than by
.
100 times the number of groups and individuals
GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516 100
# fun_dp(df2) 3.253872 3.477195 3.58173 3.73378 75.8730 100
# fun_by(df2) 92.248791 102.122733 104.09577 109.99285 186.6829 100
ave
is really loosing ground now. dplyr
is nearly 30 times faster than by
, and more than 100 times faster than ave
.
回答2:
Try this:
bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
foo <- table(xx$IndId,xx$PropertyType)
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(df,bar[as.character(df$GroupId)])
The key is using by()
to apply a function by a grouping variable, here your df$GroupId
. The function to apply is an anonymous function. For each chunk (defined by the grouping variable), it creates a table
of the IndId
and PropertyType
entries. It then looks whether "1" appears at all in the PropertyType
- if not, it returns FALSE
, if yes, it looks whether every IndId
has at least one "1" entry (i.e., whether all entries in the "1" column of the table
are >0).
We store the result of the by()
call in a structure bar
, which is named according to the levels in the grouping variable. This in turn allows us to roll the result back out to the original data.frame
. Note how I am using as.character()
here to make sure the integers are interpreted as entry names, not entry numbers. Bad Things often happen when things have names that can be interpreted as numbers.
If you really want a 0-1 result instead of TRUE-FALSE
, just add an as.numeric()
.
EDIT. Let's turn this into a function.
foobar <- function(object, group, ind, type) {
bar <- by(data=object,INDICES=object[,group],FUN=function(xx){
foo <- table(xx[,ind],xx[,type])
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(object,bar[as.character(object[,group])])
}
foobar(df,"GroupId","IndId","PropertyType")
This still requires that the target be exactly "1", but of course this could also be included in the function definition as a parameter. Just be sure to keep column names and variables that contain column names straight.