reshape wide to long with character suffixes inste

2019-01-14 16:54发布

问题:

Inspired by a comment from @gsk3 on a question about reshaping data, I started doing a little bit of experimentation with reshaping data where the variable names have character suffixes instead of numeric suffixes.

As an example, I'll load the dadmomw dataset from one of the UCLA ATS Stata learning webpages (see "Example 4" on the webpage).

Here's what the dataset looks like:

library(foreign)
dadmom <- read.dta("https://stats.idre.ucla.edu/stat/stata/modules/dadmomw.dat")
dadmom
#   famid named  incd namem  incm
# 1     1  Bill 30000  Bess 15000
# 2     2   Art 22000   Amy 18000
# 3     3  Paul 25000   Pat 50000

When trying to reshape from this wide format to long, I run into a problem. Here's what I do to reshape the data.

reshape(dadmom, direction="long", idvar=1, varying=2:5, 
        sep="", v.names=c("name", "inc"), timevar="dadmom",
        times=c("d", "m"))
#     famid dadmom  name  inc
# 1.d     1      d 30000 Bill
# 2.d     2      d 22000  Art
# 3.d     3      d 25000 Paul
# 1.m     1      m 15000 Bess
# 2.m     2      m 18000  Amy
# 3.m     3      m 50000  Pat

Note the swapped column names for "name" and "inc"; changing v.names to c("inc", "name") doesn't solve the problem.

reshape seems very picky about wanting the columns to be named in a fairly standard way. For example, I can reshape the data correctly (and easily) if I first rename the columns:

dadmom2 <- dadmom # Just so we can continue experimenting with the original data
# Change the names of the last four variables to include a "."
names(dadmom2)[2:5] <- gsub("(d$|m$)", "\\.\\1", names(dadmom2)[2:5])
reshape(dadmom2, direction="long", idvar=1, varying=2:5, 
        timevar="dadmom")
#     famid dadmom name   inc
# 1.d     1      d Bill 30000
# 2.d     2      d  Art 22000
# 3.d     3      d Paul 25000
# 1.m     1      m Bess 15000
# 2.m     2      m  Amy 18000
# 3.m     3      m  Pat 50000

My questions are:

  1. Why is R swapping the columns in the example I've provided?
  2. Can I get to this result with base R reshape without changing the variable names before reshaping?
  3. Are there other approaches that could be considered instead of reshape?

回答1:

This works (to specify to varying what columns go with who):

reshape(dadmom, direction="long",  varying=list(c(2, 4), c(3, 5)), 
        sep="", v.names=c("name", "inc"), timevar="dadmom",
        times=c("d", "m"))

So you actually have nested repeated measures here; both name and inc for mom and dad. Because you have more than one series of repeated measures you have to supply a list to varying that tells reshape which group gets stacked on the other group.

So the two approaches to this problem are to provide a list as I did or to rename the columns the way the R beast likes them as you did.

See my recent blogs on base reshape for more on this (particularly the second link deals with this):

reshape (part I)

reshape (part II)



回答2:

Though this question was specifically about base R, it is useful to know other approaches that help you to achieve the same type of outcome.

One alternative to reshape or merged.stack would be to use a combination of "dplyr" and "tidry", like this:

dadmom %>%
  gather(variable, value, -famid) %>%               ## Make the entire dataset long
  separate(variable, into = c("var", "time"),       ## Split "variable" column into two...
           sep = "(?<=name|inc)", perl = TRUE) %>%  ## ... using regex to split the values
  spread(var, value, convert = TRUE)                ## Make result wide, converting type
#   famid time   inc name
# 1     1    d 30000 Bill
# 2     1    m 15000 Bess
# 3     2    d 22000  Art
# 4     2    m 18000  Amy
# 5     3    d 25000 Paul
# 6     3    m 50000  Pat

Another alternative would be to use melt from "data.table", like this:

library(data.table)
melt(as.data.table(dadmom),             ## melt here requres a data.table 
     measure = patterns("name", "inc"), ## identify columns by patterns
     value.name = c("name", "inc"))[    ## specify the resulting variable names
       ## melt creates a numeric "variable" value. Replace with factored labels
       , variable := factor(variable, labels = c("d", "m"))][]
#    famid variable name   inc
# 1:     1        d Bill 30000
# 2:     2        d  Art 22000
# 3:     3        d Paul 25000
# 4:     1        m Bess 15000
# 5:     2        m  Amy 18000
# 6:     3        m  Pat 50000

How do these approaches compare with merged.stack?

  • Both packages are much better supported. They update and test their code more extensively than I do.
  • melt is blazing fast.
  • The Hadleyverse approach is actually slower (in many of my tests, even slower than base R's reshape) probably because of having to make the data long, then wide, then performing type conversion. However, some users like its step-by-step approach.
  • The Hadleyverse approach might have some unintended consequences because of the requirement of making the data long before making it wide. That forces all of the measure columns to be coerced to the same type (usually "character") if they are of different types to begin with.
  • Neither have the same convenience of merged.stack. Just look at the code required to get the result ;-)

merged.stack, however, can probably benefit from a simplified update, something along the lines of this function

ReshapeLong_ <- function(indt, stubs, sep = NULL) {
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  mv <- lapply(stubs, function(y) grep(sprintf("^%s", y), names(indt)))
  levs <- unique(gsub(paste(stubs, collapse="|"), "", names(indt)[unlist(mv)]))
  if (!is.null(sep)) levs <- gsub(sprintf("^%s", sep), "", levs, fixed = TRUE)
  melt(indt, measure = mv, value.name = stubs)[
    , variable := factor(variable, labels = levs)][]
}

Which can then be used as:

ReshapeLong_(dadmom, stubs = c("name", "inc"))

How do these approaches compare with base R's reshape?

  • The main difference is that reshape is not able to handle unbalanced panel datasets. See, for example, "mydf2" as opposed to "mydf" in the tests below.

Test cases

Here's some sample data. "mydf" is balanced. "mydf2" is not balanced.

set.seed(1)
x <- 10000
mydf <- mydf2 <- data.frame(
  id_1 = 1:x, id_2 = c("A", "B"), varAa = sample(letters, x, TRUE), 
  varAb = sample(letters, x, TRUE), varAc = sample(letters, x, TRUE),
  varBa = sample(10, x, TRUE), varBb = sample(10, x, TRUE), 
  varBc = sample(10, x, TRUE), varCa = rnorm(x), varCb = rnorm(x), 
  varCc = rnorm(x), varDa = rnorm(x), varDb = rnorm(x), varDc = rnorm(x))

mydf2 <- mydf2[-c(9, 14)] ## Make data unbalanced

Here are some functions to test:

f1 <- function(mydf) {
  mydf %>%
    gather(variable, value, starts_with("var")) %>%
    separate(variable, into = c("var", "time"),
             sep = "(?<=varA|varB|varC|varD)", perl = TRUE) %>%
    spread(var, value, convert = TRUE) 
}

f2 <- function(mydf) {
  melt(as.data.table(mydf),
       measure = patterns(paste0("var", c("A", "B", "C", "D"))),
       value.name = paste0("var", c("A", "B", "C", "D")))[
         , variable := factor(variable, labels = c("a", "b", "c"))][]
}

f3 <- function(mydf) {
  merged.stack(mydf, var.stubs = paste0("var", c("A", "B", "C", "D")), sep = "var.stubs")
}

## Won't run with "mydf2". Should run with "mydf"
f4 <- function(mydf) {
  reshape(mydf, direction = "long", 
          varying = lapply(c("varA", "varB", "varC", "varD"), 
                           function(x) grep(x, names(mydf))), 
          sep = "", v.names = paste0("var", c("A", "B", "C", "D")), 
          timevar="time", times = c("a", "b", "c"))
}

Test performance:

library(microbenchmark)
microbenchmark(f1(mydf), f2(mydf), f3(mydf), f4(mydf))
# Unit: milliseconds
#      expr        min         lq       mean     median         uq       max neval
#  f1(mydf) 463.006547 492.073086 528.533319 514.189548 538.910756 867.93356   100
#  f2(mydf)   3.737321   4.108376   6.674066   4.332391   4.761681  47.71142   100
#  f3(mydf)  60.211254  64.766770  86.812077  87.040087  92.841747 262.89409   100
#  f4(mydf)  40.596455  43.753431  61.006337  48.963145  69.983623 230.48449   100

Observations:

  • Base R's reshape would not be able to handle reshaping "mydf2".
  • The "dplyr" + "tidyr" approach would mangle the results in the resulting "varB", "varC", and "varD" because values would be coerced to character.
  • As the benchmarks show, reshape gives reasonable performance.

Note: Because of the difference in time between posting my last answer and the differences in approach, I thought I would share this as a new answer.



回答3:

merged.stack from my "splitstackshape" handles this by utilizing the sep = "var.stubs" construct:

library(splitstackshape)
merged.stack(dadmom, var.stubs = c("inc", "name"), sep = "var.stubs")
#    famid .time_1   inc name
# 1:     1       d 30000 Bill
# 2:     1       m 15000 Bess
# 3:     2       d 22000  Art
# 4:     2       m 18000  Amy
# 5:     3       d 25000 Paul
# 6:     3       m 50000  Pat

Notice that since there is no real separator in the variables that are being stacked, we can just strip out the var.stubs from the names to create the "time" variables. Using sep = "var.stubs" is equivalent to doing sep = "inc|name".

This works because ".time_1" is created by stripping out what is left after removing the "var.stubs" from the column names.



标签: r reshape