Subset by multiple conditions

2019-02-17 06:03发布

Maybe it's something basic, but I couldn't find the answer.
I have

Id Year V1  
 1 2009 33   
 1 2010 67  
 1 2011 38  
 2 2009 45  
 3 2009 65  
 3 2010 74  
 4 2009 47  
 4 2010 51  
 4 2011 14

I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.

Id Year V1  
 1 2009 33  
 1 2010 67  
 1 2011 38  
 4 2009 47  
 4 2010 51  
 4 2011 14   

I try

d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )

but it doesn't work.

Can anyone provide some suggestions that how I can do this in R?

标签: r subset
4条回答
祖国的老花朵
2楼-- · 2019-02-17 06:33

This should do the job :)

library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]
查看更多
聊天终结者
3楼-- · 2019-02-17 06:36

I think an approach using ave is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table.

Again, just throwing this out there to provide some options to use different aspects of the language.

d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))


# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1  1 2009 33
# 2  1 2010 67
# 3  1 2011 38
# 7  4 2009 47
# 8  4 2010 51
# 9  4 2011 14

# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
#   ID Year V1
# 1  1 2009 33
# 2  1 2010 67
# 3  1 2011 38
# 7  4 2009 47
# 8  4 2010 51
# 9  4 2011 14

OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)

# medium way
library(data.table)
d2 <- as.data.table(d1)

d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
#    ID Year V1
# 1:  1 2009 33
# 2:  1 2010 67
# 3:  1 2011 38
# 4:  4 2009 47
# 5:  4 2010 51
# 6:  4 2011 14


# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
#    ID Year V1
# 1:  1 2009 33
# 2:  1 2010 67
# 3:  1 2011 38
# 4:  4 2009 47
# 5:  4 2010 51
# 6:  4 2011 14
查看更多
何必那么认真
4楼-- · 2019-02-17 06:38

I think ave could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x). This gives a logical vector, which can be summed. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.

df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
#   Id Year V1
# 1  1 2009 33
# 2  1 2010 67
# 3  1 2011 38
# 7  4 2009 47
# 8  4 2010 51
# 9  4 2011 14
查看更多
Viruses.
5楼-- · 2019-02-17 06:46

Another way of using ave

DF
##   Id Year V1
## 1  1 2009 33
## 2  1 2010 67
## 3  1 2011 38
## 4  2 2009 45
## 5  3 2009 65
## 6  3 2010 74
## 7  4 2009 47
## 8  4 2010 51
## 9  4 2011 14


DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
##   Id Year V1
## 1  1 2009 33
## 2  1 2010 67
## 3  1 2011 38
## 7  4 2009 47
## 8  4 2010 51
## 9  4 2011 14
查看更多
登录 后发表回答