Reshaping from long to wide with some missing data

2019-02-20 18:41发布

问题:

When using stats:::reshape() from base to convert data from long to wide format, for any variables designated as time invariant, reshape just takes the first observed value, and if the variable is actually varying in some way, outputs a warning. In my case, I have missing data on variables I would like to designate as time invariant, but since I have this data at other time points, I would like the value from those time points to be used rather than the NA which is first observed.

testdata <- data.frame(matrix(c(1,1,2,3,4,3.5,NA,6,4,1,2,1), nrow = 3))
colnames(testdata) <- c("id", "process1", "timeinvariant", "time")
# > testdata
#   id process1 timeinvariant time
# 1  1      3.0            NA    1
# 2  1      4.0             6    2
# 3  2      3.5             4    1

# Note here the missing data on the invariant process at time 1
reshaped <- reshape(testdata, v.names="process1", direction = "wide")
# > reshaped
#   id timeinvariant process1.1 process1.2
# 1  1            NA        3.0          4
# 3  2             4        3.5         NA

The NA is propogated through to the wide format, when I would rather take the value observed at time 2 (or whenever).

回答1:

I don't know how to fix the problem but one way to fix the symptom would be to push the NA values down in the order.

testdata <- testdata[order(testdata$timeinvariant),]
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0            NA    1
reshaped<-reshape(testdata,v.names="process1",direction="wide")
reshaped
#  id timeinvariant process1.1 process1.2
#3  2             4        3.5         NA
#2  1             6        3.0          4

A more general solution will be to make sure there is only one value in the timevariant column per id

testdata$timeinvariant <- apply(testdata,1,function(x) max(testdata[testdata$id == x[1],"timeinvariant"],na.rm=T))
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0             6    1

This can be repeated for any number of columns before calling the reshape function. Hope this helps



回答2:

If there's always at least one non-missing value of timeinvariant for each id, and all (non-missing) values of timeinvariant are identical for each id (since it's time-invariant), couldn't you create a new column that fills in the NA values in timeinvariant and then reshape using that column? For example:

# Add another row to your data frame so that we'll have 2 NA values to deal with
td <- data.frame(matrix(c(1,1,2,1,3,4,3.5,4.5,NA,6,4,NA,1,2,1,3), nrow = 4))
colnames(td) <- c("id", "process1", "timeinvariant", "time")

# Create new column timeinvariant2, which fills in NAs from timeinvariant,
# then reshape using that column
library(dplyr)
td.wide = td %>%
  group_by(id) %>%
  mutate(timeinvariant2=max(timeinvariant, na.rm=TRUE)) %>%
  dcast(id + timeinvariant2 ~ time, value.var='process1')

# Paste "process1." onto names of all "time" columns
names(td.wide) = gsub("(^[0-9]*$)", "process1\\.\\1", names(td.wide) )

td.wide

  id timeinvariant2 process1.1 process1.2 process1.3
1  1              6        3.0          4        4.5
2  2              4        3.5         NA         NA


标签: r reshape