I have a df with two variables, names and dates. I would like to create a new column (new_dates) which takes the first date belonging to each person (each person should have just one repeated date in this column) and add 30 days to each date as the rows descend.
Desired output is below. So row1 for each person holds the original date, row2 holds row1+30, row3 holds row2+30 and so on.
dff
names dates new_dates
1 john 2010-06-01 2010-06-01
2 john 2010-06-01 2010-07-01
3 john 2010-06-01 2010-07-31
4 john 2010-06-01 2010-08-30
5 mary 2010-07-09 2010-07-09
6 mary 2010-07-09 2010-08-08
7 mary 2010-07-09 2010-09-07
8 mary 2010-07-09 2010-10-07
9 tom 2010-06-01 2010-06-01
10 tom 2010-06-01 2010-07-01
11 tom 2010-06-01 2010-07-31
12 tom 2010-06-01 2010-08-30
I thought I could use transform for this. Here is my attempt at it - but it doesn't quite work for me.
dt <- transform(df, new_date = c(dates[2]+30, NA))
data.table
makes this easy. Once you convert to a data table, it's basically one command. The main problem you're having with your version is that you need to split the data by name first, so you can get the minimum date for each person, and then add the appropriate mutiple of 30 days to each date.
library(data.table)
df$dates <- as.Date(df$dates)
dt <- as.data.table(df)
dt[,
list(dates, new_dates=min(dates) + 0:(length(dates) - 1L) * 30),
by=names
]
# names dates new_dates
# 1: john 2010-06-01 2010-06-01
# 2: john 2010-06-01 2010-07-01
# 3: john 2010-06-01 2010-07-31
# 4: john 2010-06-01 2010-08-30
# 5: mary 2010-07-09 2010-07-09
# 6: mary 2010-07-09 2010-08-08
# 7: mary 2010-07-09 2010-09-07
# 8: mary 2010-07-09 2010-10-07
# 9: tom 2010-06-01 2010-06-01
# 10: tom 2010-06-01 2010-07-01
# 11: tom 2010-06-01 2010-07-31
# 12: tom 2010-06-01 2010-08-30
EDIT: here is a version that hopefully shows why yours didn't work. I still prefer data.table
, but hopefully since this is basically very close to what you were doing it makes it clear what you need to change:
re_date <- function(df) {
transform(
df[order(df$dates), ],
new_dates=min(dates) + 30 * 0:(length(dates) - 1L)
) }
do.call(rbind, lapply(split(df, df$name), re_date))
Starting with the bottom line (do.call...
), the split
call makes a list with three data frames, one with the values for John, one for those for Mary, and one for those for Tom. The lapply
then runs each of those data frames through the re_date
function, which adds the new_dates
column, and finally, the do.call
/rbind
stitches it back together into one data frame.
sorry, quickly read the question and didn't realize what you were doing at first.
definitely a brute-force method, and my programming is not, how you say, elegant, but it seems to give the desired result:
df <- psych::read.clipboard()
df <- data.frame(names = df$names,
dates = as.Date(df$dates))
library(lubridate)
tmp <- unlist(lapply(unique(df$names), function(x) {
tmp <- df[df$names == x, 2, drop = FALSE]
sapply(1:dim(tmp)[1], function(y) {
tmp[1, 1] + days(30) * (y - 1)
})
} ))
df$new_dates <- as.Date(tmp, origin = '1970-01-01')
> df
names dates new_dates
1 john 2010-06-01 2010-06-01
2 john 2010-06-01 2010-07-01
3 john 2010-06-01 2010-07-31
4 john 2010-06-01 2010-08-30
5 mary 2010-07-09 2010-07-09
6 mary 2010-07-09 2010-08-08
7 mary 2010-07-09 2010-09-07
8 mary 2010-07-09 2010-10-07
9 tom 2010-06-01 2010-06-01
10 tom 2010-06-01 2010-07-01
11 tom 2010-06-01 2010-07-31
12 tom 2010-06-01 2010-08-30
What you're exactly looking for is a bit confusing to me. I'm assuming that you're starting with a small data frame that looks like this:
> df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")))
> df
names dates
1 john 2010-06-01
2 mary 2010-07-09
3 tom 2010-06-01
And then want to add N rows to your data frame that have your new dates column. If so, I'm sure there are some pre-packaged ways to do this but you could also use two nested lapply()
calls. The inner most call would simply add a new column where newdates is set to be some multiple of 30 plus your original date and then the outer most call would be passing in your multiple of 30. For example:
> do.call(rbind,lapply(30*0:3,function(y) do.call(rbind,lapply(1:nrow(df),function(x) data.frame(names=df$names[x],dates=df$dates[x],newdates=df$dates[x]+y)))))
names dates newdates
1 john 2010-06-01 2010-06-01
2 mary 2010-07-09 2010-07-09
3 tom 2010-06-01 2010-06-01
4 john 2010-06-01 2010-07-01
5 mary 2010-07-09 2010-08-08
6 tom 2010-06-01 2010-07-01
7 john 2010-06-01 2010-07-31
8 mary 2010-07-09 2010-09-07
9 tom 2010-06-01 2010-07-31
10 john 2010-06-01 2010-08-30
11 mary 2010-07-09 2010-10-07
12 tom 2010-06-01 2010-08-30
Anyway, this method isn't ideal and may be confusing so let me know if this is what you're looking for and I can provide more details about what is going on.