Reshape/gather function to create dataset ready fo

2019-02-20 02:05发布

问题:

I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.

Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.

Let's say this is my data:

id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))

mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names

>mydf

   id p1 p2 p3 pr1 pr2 pr3
1  1  6  20 22 1   10  24
2  2  8  11 24 2   18  29
3  3  7  10 25 6   20  26
4  4  3  14 20 10  15  20
5  5  5  19 29 7   14  22

I want my data finally to look like this:

id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))

mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2

>mydf
   id couple q1 q2 q3
1   1      1  6 23 16
2   1      2 10 28 10
3   2      1  1 27 14
4   2      2  7 21 20
5   3      1  5 30 18
6   3      2 12  2 27
7   4      1 10  1 25
8   4      2 13  7 21
9   5      1 11  6 20
10  5      2 18  3 23

Or, if this is not possible, like this:

   id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1   1      1  6  23  16
2   1      2             10   28   10
3   2      1  1  27  14
4   2      2             7    21   20
5   3      1  5  30  18
6   3      2             12   2    27
7   4      1 10   1  25
8   4      2             13   7    21
9   5      1 11   6  20
10  5      2             18   3    23

Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.

note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire

Example of what I've tried

df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)

回答1:

One approach would be to use unite and separate in tidyr, along with the gather function as well.

I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:

mydf %>% 
  unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
  unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
  gather(couple, value, `1`:`2`) %>% # Form into long data
  separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
  arrange(id)

Which gives you:

   id couple q1 q2 q3
1   1      1  9 18 25
2   1      2 10 18 30
3   2      1  1 11 29
4   2      2  2 15 29
5   3      1 10 19 26
6   3      2  3 19 25
7   4      1  7 10 23
8   4      2  1 20 28
9   5      1  6 16 21
10  5      2  5 12 26

Our numbers are different since they were all randomly generated with sample.


Edited per @alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.



回答2:

If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:

library(tidyverse)
set.seed(47)    # for reproducibility

mydf <- data.frame(id = c(1:5),
                   p1 = c(sample(1:10, 5)),
                   p2 = c(sample(10:20, 5)),
                   p3 = c(sample(20:30, 5)),
                   pr1 = c(sample(1:10, 5)),
                   pr2 = c(sample(10:20, 5)),
                   pr3 = c(sample(20:30, 5)))

mydf_long <- mydf %>% 
    gather(var, val, -id) %>% 
    separate(var, c('couple', 'q'), -2) %>% 
    mutate(q = paste0('q', q)) %>% 
    spread(q, val)

mydf_long
#>    id couple q1 q2 q3
#> 1   1      p 10 17 21
#> 2   1     pr 10 11 24
#> 3   2      p  4 13 27
#> 4   2     pr  4 15 20
#> 5   3      p  7 14 30
#> 6   3     pr  1 14 29
#> 7   4      p  6 18 24
#> 8   4     pr  8 20 30
#> 9   5      p  9 16 23
#> 10  5     pr  3 18 25