Partitioning data set by time intervals in R

2019-04-16 17:47发布

问题:

I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.

The sample of the data is below.

date                                 obs
2011-10-24 01:00:00                  12
2011-10-24 02:00:00                  4
2011-10-24 19:00:00                  18
2011-10-24 20:00:00                  7
2011-10-24 21:00:00                  4
2011-10-24 22:00:00                  2
2011-10-25 00:00:00                  4
2011-10-25 01:00:00                  2
2011-10-25 02:00:00                  2
2011-10-25 15:00:00                  12
2011-10-25 18:00:00                  2
2011-10-25 19:00:00                  3
2011-10-25 21:00:00                  2
2011-10-25 23:00:00                  9
2011-10-26 00:00:00                  13
2011-10-26 01:00:00                  11

回答1:

First I entered the data with the multiple spaces replaced with tabs.

dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
                 date obs
1 2011-10-24 01:00:00  12
2 2011-10-24 02:00:00   4
3 2011-10-24 19:00:00  18
4 2011-10-24 20:00:00   7
5 2011-10-24 21:00:00   4
6 2011-10-24 22:00:00   2

$`297`
                  date obs
7  2011-10-25 00:00:00   4
8  2011-10-25 01:00:00   2
9  2011-10-25 02:00:00   2
10 2011-10-25 15:00:00  12
11 2011-10-25 18:00:00   2
12 2011-10-25 19:00:00   3
13 2011-10-25 21:00:00   2
14 2011-10-25 23:00:00   9

$`298`
                  date obs
15 2011-10-26 00:00:00  13
16 2011-10-26 01:00:00  11

The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.

?cut.POSIXt
  split(dat , cut(dat$date, "week") )

If you wanted to sum within date:

tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298 
 47  36  24 


回答2:

I'd use a time series class such as xts

dat <- read.table(text="2011-10-24 01:00:00                  12
2011-10-24 02:00:00                  4
2011-10-24 19:00:00                  18
2011-10-24 20:00:00                  7
2011-10-24 21:00:00                  4
2011-10-24 22:00:00                  2
2011-10-25 00:00:00                  4
2011-10-25 01:00:00                  2
2011-10-25 02:00:00                  2
2011-10-25 15:00:00                  12
2011-10-25 18:00:00                  2
2011-10-25 19:00:00                  3
2011-10-25 21:00:00                  2
2011-10-25 23:00:00                  9
2011-10-26 00:00:00                  13
2011-10-26 01:00:00                  11", header=FALSE, stringsAsFactors=FALSE)

xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))

xts subsetting is very intuitive. For all data on "2011-10-25", do this

xobj["2011-10-25"]
#                    [,1]
#2011-10-25 00:00:00    4
#2011-10-25 01:00:00    2
#2011-10-25 02:00:00    2
#2011-10-25 15:00:00   12
#2011-10-25 18:00:00    2
#2011-10-25 19:00:00    3
#2011-10-25 21:00:00    2
#2011-10-25 23:00:00    9

You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)

xobj["2011-10-24/2011-10-25"]

Or, if you want all data from October 2011,

xobj["2011-10"]

If you want to get all data from any day that is between 19:00 and 20:00,

xobj['T19:00:00/T20:00:00']
#                    [,1]
#2011-10-24 19:00:00   18
#2011-10-24 20:00:00    7
#2011-10-25 19:00:00    3

You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)

endpoints(xobj, "days")
[1]  0  6 14 16    

Or you can convert to a lower frequency

to.weekly(xobj)
#           xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26        12        18        2         11
to.daily(xobj)
#           xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25        12        18        2          2
#2011-10-26         4        12        2          9
#2011-10-26        13        13       11         11

Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE

to.daily(xobj, OHLC=FALSE)
#           [,1]
#2011-10-25    2
#2011-10-26    9
#2011-10-26   11

For more basic subsetting, and much more, visit http://www.quantmod.com/examples/

As @JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.

You can split by day (or week, or month, etc), apply a function, then recombine

split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
#                    [,1]
#2011-10-24 01:00:00   12
#2011-10-24 02:00:00    4
#2011-10-24 19:00:00   18
#2011-10-24 20:00:00    7
#2011-10-24 21:00:00    4
#2011-10-24 22:00:00    2
#
#[[2]]
#                    [,1]
#2011-10-25 00:00:00    4
#2011-10-25 01:00:00    2
#2011-10-25 02:00:00    2
#2011-10-25 15:00:00   12
#2011-10-25 18:00:00    2
#2011-10-25 19:00:00    3
#2011-10-25 21:00:00    2
#2011-10-25 23:00:00    9
#
#[[3]]
#                    [,1]
#2011-10-26 00:00:00   13
#2011-10-26 01:00:00   11

Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.

do.call(rbind, lapply(split(xobj, 'days'), first))
#                    [,1]
#2011-10-24 01:00:00   12
#2011-10-25 00:00:00    4
#2011-10-26 00:00:00   13