Inspired by a comment from @gsk3 on a question about reshaping data, I started doing a little bit of experimentation with reshaping data where the variable names have character suffixes instead of numeric suffixes.
As an example, I'll load the dadmomw
dataset from one of the UCLA ATS Stata learning webpages (see "Example 4" on the webpage).
Here's what the dataset looks like:
library(foreign)
dadmom <- read.dta("https://stats.idre.ucla.edu/stat/stata/modules/dadmomw.dat")
dadmom
# famid named incd namem incm
# 1 1 Bill 30000 Bess 15000
# 2 2 Art 22000 Amy 18000
# 3 3 Paul 25000 Pat 50000
When trying to reshape from this wide format to long, I run into a problem. Here's what I do to reshape the data.
reshape(dadmom, direction="long", idvar=1, varying=2:5,
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
# famid dadmom name inc
# 1.d 1 d 30000 Bill
# 2.d 2 d 22000 Art
# 3.d 3 d 25000 Paul
# 1.m 1 m 15000 Bess
# 2.m 2 m 18000 Amy
# 3.m 3 m 50000 Pat
Note the swapped column names for "name" and "inc"; changing v.names
to c("inc", "name")
doesn't solve the problem.
reshape
seems very picky about wanting the columns to be named in a fairly standard way. For example, I can reshape the data correctly (and easily) if I first rename the columns:
dadmom2 <- dadmom # Just so we can continue experimenting with the original data
# Change the names of the last four variables to include a "."
names(dadmom2)[2:5] <- gsub("(d$|m$)", "\\.\\1", names(dadmom2)[2:5])
reshape(dadmom2, direction="long", idvar=1, varying=2:5,
timevar="dadmom")
# famid dadmom name inc
# 1.d 1 d Bill 30000
# 2.d 2 d Art 22000
# 3.d 3 d Paul 25000
# 1.m 1 m Bess 15000
# 2.m 2 m Amy 18000
# 3.m 3 m Pat 50000
My questions are:
- Why is R swapping the columns in the example I've provided?
- Can I get to this result with base R
reshape
without changing the variable names before reshaping? - Are there other approaches that could be considered instead of
reshape
?
This works (to specify to varying what columns go with who):
So you actually have nested repeated measures here; both name and inc for mom and dad. Because you have more than one series of repeated measures you have to supply a
list
to varying that tellsreshape
which group gets stacked on the other group.So the two approaches to this problem are to provide a list as I did or to rename the columns the way the R beast likes them as you did.
See my recent blogs on base
reshape
for more on this (particularly the second link deals with this):reshape (part I)
reshape (part II)
Though this question was specifically about base R, it is useful to know other approaches that help you to achieve the same type of outcome.
One alternative to
reshape
ormerged.stack
would be to use a combination of "dplyr" and "tidry", like this:Another alternative would be to use
melt
from "data.table", like this:How do these approaches compare with
merged.stack
?melt
is blazing fast.reshape
) probably because of having to make the data long, then wide, then performing type conversion. However, some users like its step-by-step approach.merged.stack
. Just look at the code required to get the result ;-)merged.stack
, however, can probably benefit from a simplified update, something along the lines of this functionWhich can then be used as:
How do these approaches compare with base R's
reshape
?reshape
is not able to handle unbalanced panel datasets. See, for example, "mydf2" as opposed to "mydf" in the tests below.Test cases
Here's some sample data. "mydf" is balanced. "mydf2" is not balanced.
Here are some functions to test:
Test performance:
Observations:
reshape
would not be able to handle reshaping "mydf2".reshape
gives reasonable performance.Note: Because of the difference in time between posting my last answer and the differences in approach, I thought I would share this as a new answer.
merged.stack
from my "splitstackshape" handles this by utilizing thesep = "var.stubs"
construct:Notice that since there is no real separator in the variables that are being stacked, we can just strip out the
var.stubs
from the names to create the "time" variables. Usingsep = "var.stubs"
is equivalent to doingsep = "inc|name"
.This works because ".time_1" is created by stripping out what is left after removing the "var.stubs" from the column names.