When subsetting rows with a factor with equal (==)

2019-06-21 19:56发布

问题:

Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do

subset1 <- df[df$A=="A1",]  
dim(subset1)  # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),] 
dim(subset2)  # 10, as expected
summary(subset2$A) # only A1 has non-zero count

And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!

回答1:

Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...

# Data...
x <- c("A",NA,"A")

# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE   NA TRUE
x[x=="A"]
#[1] "A" NA  "A"

# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1]  TRUE FALSE  TRUE
x[ x %in% "A" ]
#[1] "A" "A"

This is because (from the docs)...

%in% is an alias for match, which is defined as

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0

If we redefine it to the standard definition of match you will see that it behaves in the same way as ==

"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE   NA TRUE


回答2:

There's a mismatch here between what you want (only the entries that match your filtering) and what R does.

The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.

Consider these cases:

x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]

Versus:

x[x < 5]
y[y < 5]

And

y < 5

It is because of this behavior that I almost never use v[logicalCondition] and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition). If you want NAs, you can use which(logicalCondition | is.na(v)).



标签: r equals subset na