This question already has an answer here:
I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1.
, Q3.2.2.
, and Q3.2.3.
:
df <- data.frame(
id = 1:10,
time = as.Date('2009-01-01') + 0:9,
Q3.2.1. = rnorm(10, 0, 1),
Q3.2.2. = rnorm(10, 0, 1),
Q3.2.3. = rnorm(10, 0, 1),
Q3.3.1. = rnorm(10, 0, 1),
Q3.3.2. = rnorm(10, 0, 1),
Q3.3.3. = rnorm(10, 0, 1)
)
# Sample data
id time Q3.2.1. Q3.2.2. Q3.2.3. Q3.3.1. Q3.3.2. Q3.3.3.
1 1 2009-01-01 -0.2059165 -0.29177677 -0.7107192 1.52718069 -0.4484351 -1.21550600
2 2 2009-01-02 -0.1981136 -1.19813815 1.1750200 -0.40380049 -1.8376094 1.03588482
3 3 2009-01-03 0.3514795 -0.27425539 1.1171712 -1.02641801 -2.0646661 -0.35353058
...
I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:
id time loop_number Q3.2 Q3.3
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01-03 1 0.35147949 -1.02641801
...
11 1 2009-01-01 2 -0.29177677 -0.4484351
12 2 2009-01-02 2 -1.19813815 -1.8376094
13 3 2009-01-03 2 -0.27425539 -2.0646661
...
21 1 2009-01-01 3 -0.71071921 -1.21550600
22 2 2009-01-02 3 1.17501999 1.03588482
23 3 2009-01-03 3 1.11717121 -0.35353058
...
The tidyr
library has the gather()
function, which works great for combining one set of columns:
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
select(id, time, loop_number, Q3.2)
id time loop_number Q3.2
1 1 2009-01-01 1 -0.20591649
2 2 2009-01-02 1 -0.19811357
3 3 2009-01-03 1 0.35147949
...
29 9 2009-01-09 3 -0.58581232
30 10 2009-01-10 3 -2.33393981
The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2
and Q3.3
, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):
df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2))
id time loop_number Q3.2 Q3.3
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01-03 1 0.35147949 -1.02641801
...
89 9 2009-01-09 3 -0.58581232 -0.13187024
90 10 2009-01-10 3 -2.33393981 -0.48502131
Is there a way to use multiple calls to gather()
like this, combining small subsets of columns like this while maintaining the correct number of rows?
It's not at all related to "tidyr" and "dplyr", but here's another option to consider:
merged.stack
from my "splitstackshape" package, V1.4.0 and above.In case you are like me, and cannot work out how to use "regular expression with capturing groups" for
extract
, the following code replicates theextract(...)
line in Hadleys' answer:The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use
mutate
in my original solution in the comments to split this column into two columns with equivalent info, aloop_number
column and aquestion_number
column.spread
can then be used to transform the long form data, which are key value pairs(question_number, value)
to wide form data.This approach seems pretty natural to me:
First gather all question columns, use
extract()
to separate intoquestion
andloop_number
, thenspread()
question back into the columns.This could be done using
reshape
. It is possible withdplyr
though.Or using
dplyr
With the recent update to
melt.data.table
, we can now melt multiple columns. With that, we can do:You can get the development version from here.