I'd like to expand observations from single row-per-id to multiple rows-per-id based on a given time interval:
> dput(df)
structure(list(id = c(123, 456, 789), gender = c(0, 1, 1), yr.start = c(2005,
2010, 2000), yr.last = c(2007, 2012, 2000)), .Names = c("id",
"gender", "yr.start", "yr.last"), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L))
> df
# A tibble: 3 x 4
id gender yr.start yr.last
<dbl> <dbl> <dbl> <dbl>
1 123 0 2005 2007
2 456 1 2010 2012
3 789 1 2000 2000
I want to get id expanded into one row per year:
> dput(df_out)
structure(list(id = c(123, 123, 123, 456, 456, 456, 789), gender = c(0,
0, 0, 1, 1, 1, 1), yr = c(2005, 2006, 2007, 2010, 2011, 2012,
2000)), .Names = c("id", "gender", "yr"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L))
> df_out
# A tibble: 7 x 3
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
I know how to melt/reshape, but I'm not sure how I can expand the years.
Thanks.
Here is a base R method.
# expand years to a list
yearList <- mapply(":", df$yr.start, df$yr.last)
Now, use this list to calculate the number of rows to repeat for each ID (the second argument of rep
) and then append it as a vector (transformed from list with unlist
) using cbind
.
# get data.frame
cbind(df[rep(seq_along(df$id), lengths(yearList)), c("id", "gender")], yr=unlist(yearList))
id gender yr
1 123 0 2005
1.1 123 0 2006
1.2 123 0 2007
2 456 1 2010
2.1 456 1 2011
2.2 456 1 2012
3 789 1 2000
You could gather
into long format and then fill in the missing rows via complete
using tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(group, yr, starts_with("yr") ) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
You can use select
to get rid of the extra column.
df %>%
gather(group, yr, starts_with("yr") ) %>%
select(-group) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
# A tibble: 8 x 3
# Groups: id, gender [3]
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
8 789 1 2000
As the OP mentions that his production data set has more than 1 M rows and he is benchmarking the different solutions, it might be worthwhile to try a data.table
version:
library(data.table) # CRAN version 1.10.4 used
data.table(DF)[, .(yr = yr.start:yr.last), by = .(id, gender)]
which returns
id gender yr
1: 123 0 2005
2: 123 0 2006
3: 123 0 2007
4: 456 1 2010
5: 456 1 2011
6: 456 1 2012
7: 789 1 2000
If there are more non-varying columns than just gender
it might be more efficient to do a join rather than including all those columns in the grouping parameter by =
:
data.table(DF)[DF[, .(yr = yr.start:yr.last), by = id], on = "id"]
id gender yr.start yr.last yr
1: 123 0 2005 2007 2005
2: 123 0 2005 2007 2006
3: 123 0 2005 2007 2007
4: 456 1 2010 2012 2010
5: 456 1 2010 2012 2011
6: 456 1 2010 2012 2012
7: 789 1 2000 2000 2000
Note that both approaches assume that id
is unique in the input data.
Benchmarking
The OP has noted that he is surprised that above data.table
solution is five times slower than lmo's base R solution, apparently with OP's production data set of more than 1 M rows.
Also, the question has attracted 5 different answers plus additional suggestions. So, it's worthwhile to compare the solution in terms of processing speed.
Data
As the production data set isn't available, and problem size among other factors like the strcuture of the data is important for benchmarking, sample data sets are created.
# parameters
n_rows <- 1E2
yr_range <- 10L
start_yr <- seq(2000L, length.out = 10L, by = 1L)
# create sample data set
set.seed(123L)
library(data.table)
DT <- data.table(id = seq_len(n_rows),
gender = sample(0:1, n_rows, replace = TRUE),
yr.start = sample(start_yr, n_rows, replace = TRUE))
DT[, yr.last := yr.start + sample(0:yr_range, n_rows, replace = TRUE)]
DF <- as.data.frame(DT)
str(DT)
Classes ‘data.table’ and 'data.frame': 100 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : int 0 1 0 1 1 0 1 1 1 0 ...
$ yr.start: int 2005 2003 2004 2009 2004 2008 2009 2006 2004 2001 ...
$ yr.last : int 2007 2013 2010 2014 2008 2017 2013 2009 2005 2002 ...
- attr(*, ".internal.selfref")=<externalptr>
For the first run, 100 rows are created, the start year can vary between 2000 and 2009, and the span of years an indivdual id
can cover is between 0 and 10 years. Thus, the result set should be expected to have approximately 100 * (10 + 1) / 2 rows.
Also, only one additional column gender
is included although the OP has told that the producion data may have 2 to 10 non-varying columns.
Code
library(magrittr)
bm <- microbenchmark::microbenchmark(
lmo = {
yearList <- mapply(":", DF$yr.start, DF$yr.last)
res_lmo <- cbind(DF[rep(seq_along(DF$id), lengths(yearList)), c("id", "gender")],
yr=unlist(yearList))
},
hao = {
res_hao <- DF %>%
dplyr::group_by(id, gender) %>%
tidyr::nest() %>%
dplyr::mutate(data = purrr::map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
tidyr::unnest() %>%
dplyr::rename(yr = data)
},
aosmith = {
res_aosmith <- DF %>%
tidyr::gather(group, yr, dplyr::starts_with("yr") ) %>%
dplyr::select(-group) %>%
dplyr::group_by(id, gender) %>%
tidyr::complete(yr = tidyr::full_seq(yr, period = 1) )
},
jason = {
res_jason <- DF %>%
dplyr::group_by(id, gender) %>%
dplyr::do(data.frame(yr=.$yr.start:.$yr.last))
},
uwe1 = {
res_uwe1 <- DT[, .(yr = yr.start:yr.last), by = .(id, gender)]
},
uwe2 = {
res_uwe2 <- DT[DT[, .(yr = yr.start:yr.last), by = id], on = "id"
][, c("yr.start", "yr.last") := NULL]
},
frank1 = {
res_frank1 <- DT[rep(1:.N, yr.last - yr.start + 1L),
.(id, gender, yr = DT[, unlist(mapply(":", yr.start, yr.last))])]
},
frank2 = {
res_frank2 <- DT[, {
m = mapply(":", yr.start, yr.last); c(.SD[rep(.I, lengths(m))], .(yr = unlist(m)))},
.SDcols=id:gender]
},
times = 3L
)
Note that references to tidyverse functions are explicit in order to avoid name conflicts due to a cluttered name space.
First run
Unit: microseconds
expr min lq mean median uq max neval
lmo 655.860 692.6740 968.749 729.488 1125.193 1520.899 3
hao 40610.776 41484.1220 41950.184 42357.468 42619.887 42882.307 3
aosmith 319715.984 336006.9255 371176.437 352297.867 396906.664 441515.461 3
jason 77525.784 78197.8795 78697.798 78869.975 79283.804 79697.634 3
uwe1 834.079 870.1375 894.869 906.196 925.264 944.332 3
uwe2 1796.910 1810.8810 1880.482 1824.852 1922.268 2019.684 3
frank1 981.712 1057.4170 1086.680 1133.122 1139.164 1145.205 3
frank2 994.172 1003.6115 1081.016 1013.051 1124.438 1235.825 3
For the given problem size of 100 rows, the timings clearly indicate that the dplyr
/ tidyr
solutions are magnitudes slower than base R or data.table
solutions.
The results are essentially consistent:
all.equal(as.data.table(res_lmo), res_uwe1)
all.equal(res_hao, res_uwe1)
all.equal(res_jason, res_uwe1)
all.equal(res_uwe2, res_uwe1)
all.equal(res_frank1, res_uwe1)
all.equal(res_frank2, res_uwe1)
return TRUE
except all.equal(res_aosmith, res_uwe1)
which returns
[1] "Incompatible type for column yr: x numeric, y integer"
Second run
Due to the long execution times, the tidyverse
solutions are skipped when benchmarking larger problem sizes.
With the modified parameters
n_rows <- 1E4
yr_range <- 100L
the result set is expected to consist of about 500'000 rows.
Unit: milliseconds
expr min lq mean median uq max neval
lmo 425.026101 447.716671 455.85324 470.40724 471.26681 472.12637 3
uwe1 9.555455 9.796163 10.05562 10.03687 10.30571 10.57455 3
uwe2 18.711805 18.992726 19.40454 19.27365 19.75091 20.22817 3
frank1 22.639031 23.129131 23.58424 23.61923 24.05685 24.49447 3
frank2 13.989016 14.124945 14.47987 14.26088 14.72530 15.18973 3
For the given problem size and structure the data.table
solutions are the fastest while the base R approach is a magnitude slower. The most concise solution uwe1
is also the fastest, here.
Note that the results depend on the structure of the data, in particular the parameters n_rows
and yr_range
and the number of non-varying columns. If there are more of those columns than just gender
the timings might look differently.
The benchmark results are in contradiction to the OP's observation on execution speed which needs to be further investigated.