I have a data frame a
with 4 identifying columns: A, B, C, D
. A second data frame b
, created with ddply()
, contains a summary of all the values for different D
s for every set of A,B,C
. A third data frame c
contains a subset of b
with bad values that I want to delete from a
.
Thus, I want a subset from a
, omitting all the rows identified by a combination of A,B,C
that are also present in c
. I can think of ways do this (ugly and inefficiently) in a loop, but, my DBA background encourages me to seek a solution that is a little bit more … direct.
In code:
a <- data.frame(
A=rep(c('2013-10-30', '2014-11-6'), each=16*20),
B=rep(1:8, each=2*20),
C=rep(1:4, each=20),
D=1:20
)
a$Val=rnorm(nrow(a))
library(plyr)
b <- ddply(a, ~B+C+A, summarise,
mean_Val=mean(Val))
# Some subset criteria based on AOI group values
c <- subset(b, mean_Val <= 0)
# EDIT: Delete all the rows from a for which the
# key-triplets A,B,C are present in c
for (i in 1:nrow(c)) {
c_row = c[i,]
a <- a[ which( !(a$A==c_row$A & a$B==c_row$B & a$C==c_row$C) ), ]
}
# This is the loopy type of 'solution' I didn't want to use
Please feel free also to address unclarities in my question. I'd be happy to edit if you can point me in the right direction.
If we already created 3 datasets and want to subset the first "a" based on the elements of "c/c1", one option is anti_join
from dplyr
library(dplyr)
anti_join(a, c1, by=c('A', 'B', 'C'))
Update
Or we could use a base R
option with interaction
to paste the columns of interest together in both datasets and check whether the elements of 2nd ('c') are in 1st ('a') using %in%
. The logical index can be used to subset "a".
a1 <- a[!(as.character(interaction(a[1:3], sep=".")) %in%
as.character(interaction(c[LETTERS[1:3]], sep="."))),]
Or as @David Arenburg mentioned, we may not need to create b
, or c
datasets to get the expected output. Using plyr
, create a new mean column ("mean_Val") in "a" with mutate
and subset
the rows with mean greater than 0 (mean_Val >0
)
library(plyr)
subset(ddply(a, ~B+C+A, mutate, mean_Val=mean(Val)), mean_Val>0)
Or a similar approach using dplyr
library(dplyr)
a %>%
group_by(B, C, A) %>%
mutate(mean_Val=mean(Val)) %>%
filter(mean_Val>0)
Or if we don't need the "mean" values as a column in "a", ave
from base R
could be used as well.
a[!!with(a, ave(Val, B, C, A, FUN=function(x) mean(x)>0)),]
If we need to keep the mean_Val
column (a variation proposed by @David Arenburg)
subset(transform(a, Mean_Val = ave(Val, B, C, A, FUN = mean)),
Mean_Val > 0)
data
set.seed(24)
a <- data.frame(A= sample(LETTERS[1:3], 20, replace=TRUE),
B=sample(LETTERS[1:3], 20, replace=TRUE), C=sample(LETTERS[1:3],
20, replace=TRUE), D=rnorm(20))
b <- a %>%
group_by(A, B, C) %>%
summarise(D=sum(D))
set.seed(39)
c1 <- b[sample(1:nrow(b), 6, replace=FALSE),]
Here's a possible data.table
solution which won't require creating neither b
or c
library(data.table)
as.data.table(a)[, if(mean(Val) > 0) .SD, by = list(B, C, A)]
Or similarly (If you also want the mean itself)
as.data.table(a)[, Mean_Val := mean(Val), list(B, C, A)][Mean_Val > 0]