Make new column which adds 30 days to date on ever

2019-08-07 10:15发布

问题:

I have a df with two variables, names and dates. I would like to create a new column (new_dates) which takes the first date belonging to each person (each person should have just one repeated date in this column) and add 30 days to each date as the rows descend.

Desired output is below. So row1 for each person holds the original date, row2 holds row1+30, row3 holds row2+30 and so on.

dff
   names      dates  new_dates
1   john 2010-06-01 2010-06-01
2   john 2010-06-01 2010-07-01
3   john 2010-06-01 2010-07-31
4   john 2010-06-01 2010-08-30
5   mary 2010-07-09 2010-07-09
6   mary 2010-07-09 2010-08-08
7   mary 2010-07-09 2010-09-07
8   mary 2010-07-09 2010-10-07
9    tom 2010-06-01 2010-06-01
10   tom 2010-06-01 2010-07-01
11   tom 2010-06-01 2010-07-31
12   tom 2010-06-01 2010-08-30

I thought I could use transform for this. Here is my attempt at it - but it doesn't quite work for me.

dt <- transform(df, new_date = c(dates[2]+30, NA))

回答1:

data.table makes this easy. Once you convert to a data table, it's basically one command. The main problem you're having with your version is that you need to split the data by name first, so you can get the minimum date for each person, and then add the appropriate mutiple of 30 days to each date.

library(data.table)
df$dates <- as.Date(df$dates)
dt <- as.data.table(df)
dt[, 
   list(dates, new_dates=min(dates) + 0:(length(dates) - 1L) * 30), 
   by=names
]
#     names      dates  new_dates
#  1:  john 2010-06-01 2010-06-01
#  2:  john 2010-06-01 2010-07-01
#  3:  john 2010-06-01 2010-07-31
#  4:  john 2010-06-01 2010-08-30
#  5:  mary 2010-07-09 2010-07-09
#  6:  mary 2010-07-09 2010-08-08
#  7:  mary 2010-07-09 2010-09-07
#  8:  mary 2010-07-09 2010-10-07
#  9:   tom 2010-06-01 2010-06-01
# 10:   tom 2010-06-01 2010-07-01
# 11:   tom 2010-06-01 2010-07-31
# 12:   tom 2010-06-01 2010-08-30

EDIT: here is a version that hopefully shows why yours didn't work. I still prefer data.table, but hopefully since this is basically very close to what you were doing it makes it clear what you need to change:

re_date <- function(df) {
  transform(
    df[order(df$dates), ], 
    new_dates=min(dates) + 30 * 0:(length(dates) - 1L)
) }
do.call(rbind, lapply(split(df, df$name), re_date))

Starting with the bottom line (do.call...), the split call makes a list with three data frames, one with the values for John, one for those for Mary, and one for those for Tom. The lapply then runs each of those data frames through the re_date function, which adds the new_dates column, and finally, the do.call/rbind stitches it back together into one data frame.



回答2:

sorry, quickly read the question and didn't realize what you were doing at first.

definitely a brute-force method, and my programming is not, how you say, elegant, but it seems to give the desired result:

df <- psych::read.clipboard()

df <- data.frame(names = df$names,
                 dates = as.Date(df$dates))

library(lubridate)


tmp <- unlist(lapply(unique(df$names), function(x) {
                  tmp <- df[df$names == x, 2, drop = FALSE]
                  sapply(1:dim(tmp)[1], function(y) {
                    tmp[1, 1] + days(30) * (y - 1)
                    })
                } ))

df$new_dates <- as.Date(tmp, origin = '1970-01-01')

> df
   names      dates  new_dates
1   john 2010-06-01 2010-06-01
2   john 2010-06-01 2010-07-01
3   john 2010-06-01 2010-07-31
4   john 2010-06-01 2010-08-30
5   mary 2010-07-09 2010-07-09
6   mary 2010-07-09 2010-08-08
7   mary 2010-07-09 2010-09-07
8   mary 2010-07-09 2010-10-07
9    tom 2010-06-01 2010-06-01
10   tom 2010-06-01 2010-07-01
11   tom 2010-06-01 2010-07-31
12   tom 2010-06-01 2010-08-30


回答3:

What you're exactly looking for is a bit confusing to me. I'm assuming that you're starting with a small data frame that looks like this:

> df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")))
> df
  names      dates
1  john 2010-06-01
2  mary 2010-07-09
3   tom 2010-06-01

And then want to add N rows to your data frame that have your new dates column. If so, I'm sure there are some pre-packaged ways to do this but you could also use two nested lapply() calls. The inner most call would simply add a new column where newdates is set to be some multiple of 30 plus your original date and then the outer most call would be passing in your multiple of 30. For example:

> do.call(rbind,lapply(30*0:3,function(y) do.call(rbind,lapply(1:nrow(df),function(x) data.frame(names=df$names[x],dates=df$dates[x],newdates=df$dates[x]+y)))))
   names      dates   newdates
1   john 2010-06-01 2010-06-01
2   mary 2010-07-09 2010-07-09
3    tom 2010-06-01 2010-06-01
4   john 2010-06-01 2010-07-01
5   mary 2010-07-09 2010-08-08
6    tom 2010-06-01 2010-07-01
7   john 2010-06-01 2010-07-31
8   mary 2010-07-09 2010-09-07
9    tom 2010-06-01 2010-07-31
10  john 2010-06-01 2010-08-30
11  mary 2010-07-09 2010-10-07
12   tom 2010-06-01 2010-08-30

Anyway, this method isn't ideal and may be confusing so let me know if this is what you're looking for and I can provide more details about what is going on.



标签: r transform rows