Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do
subset1 <- df[df$A=="A1",]
dim(subset1) # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),]
dim(subset2) # 10, as expected
summary(subset2$A) # only A1 has non-zero count
And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in%
for factors and always include !is.na
when using equal? Thanks!
Yes, the return types of ==
and %in%
are different with respect to NA
because of how "%in%"
is defined...
# Data...
x <- c("A",NA,"A")
# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE NA TRUE
x[x=="A"]
#[1] "A" NA "A"
# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1] TRUE FALSE TRUE
x[ x %in% "A" ]
#[1] "A" "A"
This is because (from the docs)...
%in%
is an alias for match
, which is defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
If we redefine it to the standard definition of match
you will see that it behaves in the same way as ==
"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE NA TRUE
There's a mismatch here between what you want (only the entries that match your filtering) and what R does.
The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.
Consider these cases:
x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]
Versus:
x[x < 5]
y[y < 5]
And
y < 5
It is because of this behavior that I almost never use v[logicalCondition]
and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition)
. If you want NAs, you can use which(logicalCondition | is.na(v))
.