reshape from base vs dcast from reshape2 with miss

2019-02-25 06:13发布

问题:

Whis this data frame,

df <- expand.grid(id="01", parameter=c("blood", "saliva"), visit=c("V1", "V2", "V3"))
df$value <- c(1:6)
df$sex <- rep("f", 6)
df

> df
  id parameter visit value sex
1 01     blood    V1     1   f
2 01    saliva    V1     2   f
3 01     blood    V2     3   f
4 01    saliva    V2     4   f
5 01     blood    V3     5   f
6 01    saliva    V3     6   f

When I reshape it in the "wide" format, I get identical results with both the base reshape function and the dcast function from reshape2.

reshape(df,
        timevar="visit",
        idvar=c("id", "parameter", "sex"),
        direction="wide")

  id parameter sex value.V1 value.V2 value.V3
1 01     blood   f        1        3        5
2 01    saliva   f        2        4        6


library(reshape2)
dcast(df,
      id+parameter+sex~visit,
      value.var="value")

  id parameter sex V1 V2 V3
1 01     blood   f  1  3  5
2 01    saliva   f  2  4  6

But if I add some missing values, the results differs

df$value <- c(1,2,NA,NA,NA,NA)
df$sex <- c(NA,NA,NA,NA,NA,NA)
df

> df
  id parameter visit value sex
1 01     blood    V1     1  NA
2 01    saliva    V1     2  NA
3 01     blood    V2    NA  NA
4 01    saliva    V2    NA  NA
5 01     blood    V3    NA  NA
6 01    saliva    V3    NA  NA

With base reshape, I get only one row

reshape(df,
        timevar="visit",
        idvar=c("id", "parameter", "sex"),
        direction="wide")

  id parameter sex value.V1 value.V2 value.V3
1 01     blood  NA        1       NA       NA

With dcast, I get two rows

dcast(df,
      id+parameter+sex~visit,
      value.var="value")

  id parameter sex V1 V2 V3
1 01     blood  NA  1 NA NA
2 01    saliva  NA  2 NA NA

Is there a way to handle these missing values in the base reshape function, as I'd like to use this one?

回答1:

The relevant part of the reshape code would be the line:

data[, tempidname] <- interaction(data[, idvar], drop = TRUE)

Look at how interaction works:

> interaction("A", "B")
[1] A.B
Levels: A.B
> interaction("A", "B", NA)
[1] <NA>
Levels: 

But, compare what would happen if NA were retained as a level:

> interaction("A", "B", addNA(NA))
[1] A.B.NA
Levels: A.B.NA

Thus, if you want to have the same result with base R's reshape, you need to make sure that any "idvar" columns have NA retained as a level.

Example:

df$sex <- addNA(df$sex)
reshape(df,
        timevar="visit",
        idvar=c("id", "parameter", "sex"),
        direction="wide")
#   id parameter  sex value.V1 value.V2 value.V3
# 1 01     blood <NA>        1       NA       NA
# 2 01    saliva <NA>        2       NA       NA

Of course, the other question is how NA can be treated as an identifying variable :-)