Progressive concatenation of a column by a group [

2019-04-26 20:33发布

问题:

This question already has an answer here:

  • Cumulatively paste (concatenate) values grouped by another variable 4 answers

Suppose that I have this input :

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   B
11           7    2014-06-17  2013-08-04   B

I would like to progressively concatenate the values of the str column by the group variable ID, as showed in the following output :

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B,B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   A,B
11           7    2014-06-17  2013-08-04   A,B,B

I tried to use the ave() function with this code :

within(table, {
  Emp_list <- ave(str, ID, FUN = function(x) paste(x, collapse = ","))
})

but it gives the following output, which is not exactly what I want :

         ID      date_1     date_2      str
1         1    2010-07-04 2008-01-20     A
2         2    2015-07-01 2011-08-31     C
3         3    2015-03-06 2013-01-18     D
4         4    2013-01-10 2011-08-30     D
5         5    2014-06-04 2011-09-18     B,B
6         5    2014-06-04 2011-09-18     B,B
7         6    2012-11-22 2011-09-28     C
8         7    2014-06-17 2013-08-04     A,B,B
10        7    2014-06-17 2013-08-04     A,B,B
11        7    2014-06-17 2013-08-04     A,B,B

Of course I'd like to avoid loops, as I work on a large database.

回答1:

How about ave() with Reduce(). The Reduce() function allows us to accumulate results as they are calculated. So if we run it with paste() we can accumulate the pasted strings.

f <- function(x) {
    Reduce(function(...) paste(..., sep = ", "), x, accumulate = TRUE)
}

df$str <- with(df, ave(as.character(str), ID, FUN = f)

which gives the updated data frame df

   ID     date_1     date_2     str
1   1 2010-07-04 2008-01-20       A
2   2 2015-07-01 2011-08-31       C
3   3 2015-03-06 2013-01-18       D
4   4 2013-01-10 2011-08-30       D
5   5 2014-06-04 2011-09-18       B
6   5 2014-06-04 2011-09-18    B, B
7   6 2012-11-22 2011-09-28       C
8   7 2014-06-17 2013-08-04       A
10  7 2014-06-17 2013-08-04    A, B
11  7 2014-06-17 2013-08-04 A, B, B

Note: function(...) paste(..., sep = ", ") could also be function(x, y) paste(x, y, sep = ", "). (Thanks Pierre Lafortune)



回答2:

Here's a possible solution combining data.table with an inner tapply that seem to get you what you need (you can use paste instead of toString if you like, it just looks cleaner to me that way).

library(data.table)
setDT(df)[, Str := tapply(str[sequence(1:.N)], rep(1:.N, 1:.N), toString), by = ID]
df
#     ID     date_1     date_2 str     Str
#  1:  1 2010-07-04 2008-01-20   A       A
#  2:  2 2015-07-01 2011-08-31   C       C
#  3:  3 2015-03-06 2013-01-18   D       D
#  4:  4 2013-01-10 2011-08-30   D       D
#  5:  5 2014-06-04 2011-09-18   B       B
#  6:  5 2014-06-04 2011-09-18   B    B, B
#  7:  6 2012-11-22 2011-09-28   C       C
#  8:  7 2014-06-17 2013-08-04   A       A
#  9:  7 2014-06-17 2013-08-04   B    A, B
# 10:  7 2014-06-17 2013-08-04   B A, B, B

You may be able to improve it a bit using

setDT(df)[, Str := {Len <- 1:.N ; tapply(str[sequence(Len)], rep(Len, Len), toString)}, by = ID]