Single row per id to multiple row per id

2019-02-20 02:25发布

问题:

I'd like to expand observations from single row-per-id to multiple rows-per-id based on a given time interval:

> dput(df)
structure(list(id = c(123, 456, 789), gender = c(0, 1, 1), yr.start = c(2005, 
2010, 2000), yr.last = c(2007, 2012, 2000)), .Names = c("id", 
"gender", "yr.start", "yr.last"), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -3L))
> df
# A tibble: 3 x 4
     id gender yr.start yr.last
  <dbl>  <dbl>    <dbl>   <dbl>
1   123      0     2005    2007
2   456      1     2010    2012
3   789      1     2000    2000

I want to get id expanded into one row per year:

> dput(df_out)
structure(list(id = c(123, 123, 123, 456, 456, 456, 789), gender = c(0, 
0, 0, 1, 1, 1, 1), yr = c(2005, 2006, 2007, 2010, 2011, 2012, 
2000)), .Names = c("id", "gender", "yr"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -7L))
> df_out
# A tibble: 7 x 3
     id gender    yr
  <dbl>  <dbl> <dbl>
1   123      0  2005
2   123      0  2006
3   123      0  2007
4   456      1  2010
5   456      1  2011
6   456      1  2012
7   789      1  2000

I know how to melt/reshape, but I'm not sure how I can expand the years. Thanks.

回答1:

Here is a base R method.

# expand years to a list
yearList <- mapply(":", df$yr.start, df$yr.last)

Now, use this list to calculate the number of rows to repeat for each ID (the second argument of rep) and then append it as a vector (transformed from list with unlist) using cbind.

# get data.frame
cbind(df[rep(seq_along(df$id), lengths(yearList)), c("id", "gender")], yr=unlist(yearList))
     id gender   yr
1   123      0 2005
1.1 123      0 2006
1.2 123      0 2007
2   456      1 2010
2.1 456      1 2011
2.2 456      1 2012
3   789      1 2000


回答2:

You could gather into long format and then fill in the missing rows via complete using tidyr.

library(dplyr)
library(tidyr)

df %>%
     gather(group, yr, starts_with("yr") ) %>%
     group_by(id, gender) %>%
     complete(yr = full_seq(yr, period = 1) )

You can use select to get rid of the extra column.

df %>%
     gather(group, yr, starts_with("yr") ) %>%
     select(-group) %>%
     group_by(id, gender) %>%
     complete(yr = full_seq(yr, period = 1) )

# A tibble: 8 x 3
# Groups:   id, gender [3]
     id gender    yr
  <dbl>  <dbl> <dbl>
1   123      0  2005
2   123      0  2006
3   123      0  2007
4   456      1  2010
5   456      1  2011
6   456      1  2012
7   789      1  2000
8   789      1  2000


回答3:

Here is a tidyverse solution

library(tidyverse)
df %>%
  group_by(id, gender) %>%
  nest() %>%
  mutate(data = map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
  unnest() %>%
  rename(year = data)

# A tibble: 7 x 3
     id gender  year
  <dbl>  <dbl> <int>
1   123      0  2005
2   123      0  2006
3   123      0  2007
4   456      1  2010
5   456      1  2011
6   456      1  2012
7   789      1  2000


回答4:

As the OP mentions that his production data set has more than 1 M rows and he is benchmarking the different solutions, it might be worthwhile to try a data.table version:

library(data.table)   # CRAN version 1.10.4 used
data.table(DF)[, .(yr = yr.start:yr.last), by = .(id, gender)]

which returns

    id gender    yr
1: 123      0  2005
2: 123      0  2006
3: 123      0  2007
4: 456      1  2010
5: 456      1  2011
6: 456      1  2012
7: 789      1  2000

If there are more non-varying columns than just gender it might be more efficient to do a join rather than including all those columns in the grouping parameter by =:

data.table(DF)[DF[, .(yr = yr.start:yr.last), by = id], on = "id"]
    id gender yr.start yr.last    yr
1: 123      0     2005    2007  2005
2: 123      0     2005    2007  2006
3: 123      0     2005    2007  2007
4: 456      1     2010    2012  2010
5: 456      1     2010    2012  2011
6: 456      1     2010    2012  2012
7: 789      1     2000    2000  2000

Note that both approaches assume that id is unique in the input data.


Benchmarking

The OP has noted that he is surprised that above data.table solution is five times slower than lmo's base R solution, apparently with OP's production data set of more than 1 M rows.

Also, the question has attracted 5 different answers plus additional suggestions. So, it's worthwhile to compare the solution in terms of processing speed.

Data

As the production data set isn't available, and problem size among other factors like the strcuture of the data is important for benchmarking, sample data sets are created.

# parameters
n_rows <- 1E2
yr_range <- 10L
start_yr <- seq(2000L, length.out = 10L, by = 1L)
# create sample data set
set.seed(123L)
library(data.table)
DT <- data.table(id = seq_len(n_rows),
                 gender = sample(0:1, n_rows, replace = TRUE),
                 yr.start = sample(start_yr, n_rows, replace = TRUE))
DT[, yr.last := yr.start + sample(0:yr_range, n_rows, replace = TRUE)]
DF <- as.data.frame(DT)
str(DT)
Classes ‘data.table’ and 'data.frame':    100 obs. of  4 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender  : int  0 1 0 1 1 0 1 1 1 0 ...
 $ yr.start: int  2005 2003 2004 2009 2004 2008 2009 2006 2004 2001 ...
 $ yr.last : int  2007 2013 2010 2014 2008 2017 2013 2009 2005 2002 ...
 - attr(*, ".internal.selfref")=<externalptr>

For the first run, 100 rows are created, the start year can vary between 2000 and 2009, and the span of years an indivdual id can cover is between 0 and 10 years. Thus, the result set should be expected to have approximately 100 * (10 + 1) / 2 rows.

Also, only one additional column gender is included although the OP has told that the producion data may have 2 to 10 non-varying columns.

Code

library(magrittr)
bm <- microbenchmark::microbenchmark(
  lmo = {
    yearList <- mapply(":", DF$yr.start, DF$yr.last)
    res_lmo <- cbind(DF[rep(seq_along(DF$id), lengths(yearList)), c("id", "gender")], 
                     yr=unlist(yearList))
  },
  hao = {
    res_hao <- DF %>%
      dplyr::group_by(id, gender) %>%
      tidyr::nest() %>%
      dplyr::mutate(data = purrr::map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
      tidyr::unnest() %>%
      dplyr::rename(yr = data)
  },
  aosmith = {
    res_aosmith <- DF %>%
      tidyr::gather(group, yr, dplyr::starts_with("yr") ) %>%
      dplyr::select(-group) %>%
      dplyr::group_by(id, gender) %>%
      tidyr::complete(yr = tidyr::full_seq(yr, period = 1) )
  },
  jason = {
    res_jason <- DF %>%
      dplyr::group_by(id, gender) %>%
      dplyr::do(data.frame(yr=.$yr.start:.$yr.last))
  },
  uwe1 = {
    res_uwe1 <- DT[, .(yr = yr.start:yr.last), by = .(id, gender)]
  },
  uwe2 = {
    res_uwe2 <- DT[DT[, .(yr = yr.start:yr.last), by = id], on = "id"
                   ][, c("yr.start", "yr.last") := NULL]
  },
  frank1 = {
    res_frank1 <- DT[rep(1:.N, yr.last - yr.start + 1L), 
                     .(id, gender, yr = DT[, unlist(mapply(":", yr.start, yr.last))])]
    },
  frank2 = {
    res_frank2 <- DT[, {
      m = mapply(":", yr.start, yr.last); c(.SD[rep(.I, lengths(m))], .(yr = unlist(m)))}, 
      .SDcols=id:gender]
  },
  times = 3L
)

Note that references to tidyverse functions are explicit in order to avoid name conflicts due to a cluttered name space.

First run

Unit: microseconds
    expr        min          lq       mean     median         uq        max neval
     lmo    655.860    692.6740    968.749    729.488   1125.193   1520.899     3
     hao  40610.776  41484.1220  41950.184  42357.468  42619.887  42882.307     3
 aosmith 319715.984 336006.9255 371176.437 352297.867 396906.664 441515.461     3
   jason  77525.784  78197.8795  78697.798  78869.975  79283.804  79697.634     3
    uwe1    834.079    870.1375    894.869    906.196    925.264    944.332     3
    uwe2   1796.910   1810.8810   1880.482   1824.852   1922.268   2019.684     3
  frank1    981.712   1057.4170   1086.680   1133.122   1139.164   1145.205     3
  frank2    994.172   1003.6115   1081.016   1013.051   1124.438   1235.825     3

For the given problem size of 100 rows, the timings clearly indicate that the dplyr/ tidyr solutions are magnitudes slower than base R or data.table solutions.

The results are essentially consistent:

all.equal(as.data.table(res_lmo), res_uwe1)
all.equal(res_hao, res_uwe1)
all.equal(res_jason, res_uwe1)
all.equal(res_uwe2, res_uwe1)
all.equal(res_frank1, res_uwe1)
all.equal(res_frank2, res_uwe1)

return TRUE except all.equal(res_aosmith, res_uwe1) which returns

[1] "Incompatible type for column yr: x numeric, y integer"

Second run

Due to the long execution times, the tidyverse solutions are skipped when benchmarking larger problem sizes.

With the modified parameters

n_rows <- 1E4
yr_range <- 100L

the result set is expected to consist of about 500'000 rows.

Unit: milliseconds
   expr        min         lq      mean    median        uq       max neval
    lmo 425.026101 447.716671 455.85324 470.40724 471.26681 472.12637     3
   uwe1   9.555455   9.796163  10.05562  10.03687  10.30571  10.57455     3
   uwe2  18.711805  18.992726  19.40454  19.27365  19.75091  20.22817     3
 frank1  22.639031  23.129131  23.58424  23.61923  24.05685  24.49447     3
 frank2  13.989016  14.124945  14.47987  14.26088  14.72530  15.18973     3

For the given problem size and structure the data.table solutions are the fastest while the base R approach is a magnitude slower. The most concise solution uwe1 is also the fastest, here.

Note that the results depend on the structure of the data, in particular the parameters n_rows and yr_range and the number of non-varying columns. If there are more of those columns than just gender the timings might look differently.

The benchmark results are in contradiction to the OP's observation on execution speed which needs to be further investigated.



回答5:

Another way using do in dplyr, but it's slower than the base R method.

df %>%
  group_by(id, gender) %>%
  do(data.frame(yr=.$yr.start:.$yr.last))

# # A tibble: 7 x 3
# # Groups:   id, gender [3]
#      id gender    yr
#   <dbl>  <dbl> <int>
# 1   123      0  2005
# 2   123      0  2006
# 3   123      0  2007
# 4   456      1  2010
# 5   456      1  2011
# 6   456      1  2012
# 7   789      1  2000