Performing aggregation through date and time in SQ

2019-04-21 15:12发布

问题:

I have a data-set which contains observations for several weeks with 2 minutes frequency. I want to increase the time interval from 2 minute to 5 minute. The problem is that, frequency of the observations are not always the same. I mean, theoretically, every 10 minute there should be 5 observation but usually it is not the case. Please let me know how I can aggregate the observations based on average function and with respect to the time and date of the observations. In other words aggregation based on every 5 minutes while number of observations are not the same for each 5 minute time interval. Moreover, I have date and time in timestamp format.

Example Data:

1 2007-09-14 22:56:12 5.39
2 2007-09-14 22:58:12 5.34
3 2007-09-14 23:00:12 5.16
4 2007-09-14 23:02:12 5.54
5 2007-09-14 23:04:12 5.30
6 2007-09-14 23:06:12 5.20

expected results:

1 2007-09-14 23:00 5.29
2 2007-09-14 23:05 5.34

回答1:

The answers to this question likely provide good solutions to your problem, showing ways to efficiently aggregate data into time windows.

Essentially, use the avg aggregate with:

GROUP BY floor(extract(epoch from the_timestamp) / 60 / 5)


回答2:

EDIT: I did a bit more thinking about this and realized that you can't just go from 2-min to 5-min. It doesn't add up. I'll follow-up on that, but the following code does work once you have some 1-min data to aggregate!

--

If the data is in a 'beginning' format you can use the code inside this function, or create the function on your database for ease of access:

CREATE OR REPLACE FUNCTION dev.beginning_datetime_floor(timestamp without time zone,   
integer)  /* switch out 'dev' with your schema name */
RETURNS timestamp without time zone AS
$BODY$ 
SELECT
date_trunc('minute',timestamp with time zone 'epoch' + 
 floor(extract(epoch from $1)/($2*60))*$2*60
 * interval '1 second') at time zone 'CST6CDT' /* change this to your time zone */
$BODY$
LANGUAGE sql VOLATILE;

You just feed it the integer number of minutes you want to aggregate on (use 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30), here's a couple results:

select dev.beginning_datetime_floor('2012-01-01 02:02:21',2)

= '2012-01-01 02:02:00'

select dev.beginning_datetime_floor('2012-01-01 02:02:21',5)

= '2012-01-01 02:00:00'

Just test it out and add or subtract time to handle beginning versus ending timestamps using the built-in timestamp functions.

When you get the timestamp you want, do what Craig said, and GROUP BY on that timestamp, in conjunction with your desired aggregate functions (likely averages).

You could test/tweak it with:

date_trunc('minute',timestamp with time zone 'epoch' + 
 floor(extract(epoch from your_datetime)/(interval_minutes*60))*interval_minutes*60
 * interval '1 second') at time zone 'CST6CDT' /* change this to your time zone */

It may turn out that you want to average the timestamps - if your interval duration is volatile for example. For this, you could make a similar function that rounds the timestamp instead of taking a floor.



回答3:

By far the simplest option is to create a reference table. In that table you store the intervals over which you are insterested:

(Adapt this to your own RDBMS's date notation.)

CREATE TABLE interval (
  start_time    DATETIME,
  cease_time    DATETIME
);
INSERT INTO interval SELECT '2012-10-22 12:00', '2012-10-22 12:05';
INSERT INTO interval SELECT '2012-10-22 12:05', '2012-10-22 12:10';
INSERT INTO interval SELECT '2012-10-22 12:10', '2012-10-22 12:15';
INSERT INTO interval SELECT '2012-10-22 12:15', '2012-10-22 12:20';
INSERT INTO interval SELECT '2012-10-22 12:20', '2012-10-22 12:25';
INSERT INTO interval SELECT '2012-10-22 12:25', '2012-10-22 12:30';
INSERT INTO interval SELECT '2012-10-22 12:30', '2012-10-22 12:35';
INSERT INTO interval SELECT '2012-10-22 12:35', '2012-10-22 12:40';

Then you just join and aggregate...

SELECT
  interval.start_time,
  AVG(observation.value)
FROM
  interval
LEFT JOIN
  observation
    ON  observation.timestamp >= interval.start_time
    AND observation.timestamp <  interval.cease_time
GROUP BY
  interval.start_time

NOTE: You only need to create and populate that intervals table once, then you can re-use it many times.



回答4:

Ok, so this is just one way to handle this. I hope this gets you thinking about how to convert the data for you analysis needs.

There's a prerequisite to test out this code. You need to have a table with all possible 1-minute timestamps. There are many ways to go about this, I'll just use what I have available, which is one table: dim_time which has each minute (00:01:00) through (23:59:00) and another table with all possible dates (dim_date). When you join these (1=1) you get all possible minutes for all possible days.

--first you need to create some functions I'll use later
--credit to this first function goes to David Walling
CREATE OR REPLACE FUNCTION dev.beginning_datetime_floor(timestamp without time zone, integer)
  RETURNS timestamp without time zone AS
$BODY$ 
SELECT
date_trunc('minute',timestamp with time zone 'epoch' + 
    floor(extract(epoch from $1)/($2*60))*$2*60
* interval '1 second') at time zone 'CST6CDT'
$BODY$
  LANGUAGE sql VOLATILE;

--the following function is what I described on my previous post  
CREATE OR REPLACE FUNCTION dev.round_minutes(timestamp without time zone, integer)
  RETURNS timestamp without time zone AS
$BODY$ 
  SELECT date_trunc('hour', $1) + cast(($2::varchar||' min') as interval) * round(date_part('minute',$1)::float / cast($2 as float)) 
$BODY$
  LANGUAGE sql VOLATILE;

--let's load the data into a temp table, I added some data points. note: i got rid of the partial seconds
SELECT cast(timestamp_original as timestamp) as timestamp_original, datapoint INTO TEMPORARY TABLE timestamps_second2
FROM
(
SELECT '2007-09-14 22:56:12' as timestamp_original, 0 as datapoint
UNION
SELECT '2007-09-14 22:58:12' as timestamp_original, 1 as datapoint
UNION
SELECT '2007-09-14 23:00:12' as timestamp_original, 10 as datapoint 
UNION
SELECT '2007-09-14 23:02:12' as timestamp_original, 100 as datapoint
UNION
SELECT '2007-09-14 23:04:12' as timestamp_original, 1000 as datapoint
UNION
SELECT '2007-09-14 23:06:12' as timestamp_original, 10000 as datapoint
) as data

--this is the bit of code you'll have to replace with your implementation of getting all possible minutes
--you could make some sequence of timestamps in R, or simply make the timestamps in Excel to test out the rest of the code
--the result of the query is simply '2007-09-14 00:00:00' through '2007-09-14 23:59:00'
SELECT * INTO TEMPORARY TABLE possible_timestamps
FROM
(
select the_date + beginning_minute as minute_timestamp
FROM datawarehouse.dim_date as dim_date
JOIN datawarehouse.dim_time as dim_time
ON 1=1
where dim_date.the_date = '2007-09-14'
group by the_date, beginning_minute
order by the_date, beginning_minute
) as data

--round to nearest minute (be sure to think about how this might change your results
SELECT * INTO TEMPORARY TABLE rounded_timestamps2
FROM
(
SELECT dev.round_minutes(timestamp_original,1) as minute_timestamp_rounded, datapoint
from timestamps_second2
) as data

--let's join what minutes we have data for versus the possible minutes
--I used some subqueries so when you select all from the table you'll see the important part (not needed)
SELECT * INTO TEMPORARY TABLE joined_with_possibles
FROM
(
SELECT *
FROM
(
SELECT *, (MIN(minute_timestamp_rounded) OVER ()) as min_time, (MAX(minute_timestamp_rounded) OVER ()) as max_time
FROM possible_timestamps as t1
LEFT JOIN rounded_timestamps2 as t2
ON t1.minute_timestamp = t2.minute_timestamp_rounded
ORDER BY t1.minute_timestamp asc
) as inner_query
WHERE minute_timestamp >= min_time
AND minute_timestamp <= max_time
) as data

--here's the tricky part that might not suit your needs, but it's one method
--if it's missing a value it grabs the previous value
--if it's missing the prior value it grabs the one before that, otherwise it's null
--best practice would be run another case statement with 0,1,2 specifying which point was pulled, then you can count those when you aggregate
SELECT * INTO TEMPORARY TABLE shifted_values
FROM
(
SELECT 
*,
case 
when datapoint is not null then datapoint
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is not null
  then lag(datapoint,1) over (order by minute_timestamp asc)
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is null and (lag(datapoint,2) over (order by minute_timestamp asc)) is not null
  then lag(datapoint,2) over (order by minute_timestamp asc)
else null end as last_good_value
from joined_with_possibles
ORDER BY minute_timestamp asc
) as data

--now we use the function from my previous post to make the timestamps to aggregate on
SELECT * INTO TEMPORARY TABLE shifted_values_with_five_minute
FROM
(
SELECT *, dev.beginning_datetime_floor(minute_timestamp,5) as five_minute_timestamp
FROM shifted_values
) as data

--finally we aggregate
SELECT
AVG(datapoint) as avg_datapoint, five_minute_timestamp
FROM shifted_values_with_five_minute
GROUP BY five_minute_timestamp