Create 10,000 date data.frames with fake years bas

2019-09-19 07:39发布

问题:

Here my time period range:

start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')

df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'

I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.

In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one). The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.

The grouped fake years need to appear in a 3rd col of the data.frames.

I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.

Can anyone help me?


@gringer gave a good answer but it solved only 90% of the problem:

dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
                                            length.out=365, by="day"),
                                 simplify=FALSE))
colnames(dates.df) <- 1:10000

What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.

I tried to change length.out = 14965 but R does not recycle the dates.


Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:

dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
                                            length.out=1, by="day"),
                                 simplify=FALSE))
colnames(dates.df) <- 1:10000

How can I add the remaining df rows to each col?

回答1:

The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:

> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"

When used in combination with replicate and sample, I think this will give what you want in a list:

> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
 [1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
 [6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"

[[2]]
 [1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
 [6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"

Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):

dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
                                            length.out=365, by="day"),
                                 simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
           1          2          3          4          5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11

To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:

df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
                      seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";

This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.

Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:

dates.df <-
    as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
                         function(startPos){
                             df$date[startPos:(startPos+nrow(df)/2-1)];
                         }));
colnames(dates.df) <- 1:10000;

>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
               1          2          3          4          5
1     1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2     1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3     1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4     1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5     1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27

This takes a bit less time now, presumably because the date values have been pre-caclulated.



回答2:

Try this one, using subsetting instead:

start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')

date_vec <- seq.Date(from=start_day, to=end_day, by="day")

Now, I create a vector long enough so that I can use easy subsetting later on:

date_vec2 <- rep(date_vec,2)

Now, create the random start dates for 100 instances (replace this with 10000 for your application):

random_starts <- sample(1:14965, 100)

Now, create a list of dates by simply subsetting date_vec2 with your desired length:

dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100

date_df[1:5,1:5]

           1          2          3          4          5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28