I'd like to expand observations from single row-per-id to multiple rows-per-id based on a given time interval:
> dput(df)
structure(list(id = c(123, 456, 789), gender = c(0, 1, 1), yr.start = c(2005,
2010, 2000), yr.last = c(2007, 2012, 2000)), .Names = c("id",
"gender", "yr.start", "yr.last"), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L))
> df
# A tibble: 3 x 4
id gender yr.start yr.last
<dbl> <dbl> <dbl> <dbl>
1 123 0 2005 2007
2 456 1 2010 2012
3 789 1 2000 2000
I want to get id expanded into one row per year:
> dput(df_out)
structure(list(id = c(123, 123, 123, 456, 456, 456, 789), gender = c(0,
0, 0, 1, 1, 1, 1), yr = c(2005, 2006, 2007, 2010, 2011, 2012,
2000)), .Names = c("id", "gender", "yr"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L))
> df_out
# A tibble: 7 x 3
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
I know how to melt/reshape, but I'm not sure how I can expand the years. Thanks.
As the OP mentions that his production data set has more than 1 M rows and he is benchmarking the different solutions, it might be worthwhile to try a
data.table
version:which returns
If there are more non-varying columns than just
gender
it might be more efficient to do a join rather than including all those columns in the grouping parameterby =
:Note that both approaches assume that
id
is unique in the input data.Benchmarking
The OP has noted that he is surprised that above
data.table
solution is five times slower than lmo's base R solution, apparently with OP's production data set of more than 1 M rows.Also, the question has attracted 5 different answers plus additional suggestions. So, it's worthwhile to compare the solution in terms of processing speed.
Data
As the production data set isn't available, and problem size among other factors like the strcuture of the data is important for benchmarking, sample data sets are created.
For the first run, 100 rows are created, the start year can vary between 2000 and 2009, and the span of years an indivdual
id
can cover is between 0 and 10 years. Thus, the result set should be expected to have approximately 100 * (10 + 1) / 2 rows.Also, only one additional column
gender
is included although the OP has told that the producion data may have 2 to 10 non-varying columns.Code
Note that references to tidyverse functions are explicit in order to avoid name conflicts due to a cluttered name space.
First run
For the given problem size of 100 rows, the timings clearly indicate that the
dplyr
/tidyr
solutions are magnitudes slower than base R ordata.table
solutions.The results are essentially consistent:
return
TRUE
exceptall.equal(res_aosmith, res_uwe1)
which returnsSecond run
Due to the long execution times, the
tidyverse
solutions are skipped when benchmarking larger problem sizes.With the modified parameters
the result set is expected to consist of about 500'000 rows.
For the given problem size and structure the
data.table
solutions are the fastest while the base R approach is a magnitude slower. The most concise solutionuwe1
is also the fastest, here.Note that the results depend on the structure of the data, in particular the parameters
n_rows
andyr_range
and the number of non-varying columns. If there are more of those columns than justgender
the timings might look differently.The benchmark results are in contradiction to the OP's observation on execution speed which needs to be further investigated.
Another way using
do
indplyr
, but it's slower than the base R method.Here is a tidyverse solution
Here is a base R method.
Now, use this list to calculate the number of rows to repeat for each ID (the second argument of
rep
) and then append it as a vector (transformed from list withunlist
) usingcbind
.You could
gather
into long format and then fill in the missing rows viacomplete
using tidyr.You can use
select
to get rid of the extra column.