I'm trying to fill daily data for missing dates and can not find an answer, please help.
My daily_table
example:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date
:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-14 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-15 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-16 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.
SUMMARY:
During a few last days I foud out that:
- Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like
JOIN LATERAL
- Redshift also does not support
generate_series
andCTEs
- But it supports simple
WITH
(thank you @systemjack) butWITH RECURSIVE
does not
Look at the idea behind the query:
You'll have to choose start and end dates according to your requirements. The query may be quite expensive (you mentioned about billions gaps) so apply it with caution (test on a smaller data subset or execute by stages).
In the absence of
generate_series()
you can create your own generator. Here is an interesting example. Views from the cited article can be used instead ofgenerate_series()
. For example, if you need the period'2016-04-12' + 5 days
:you'll get the same result like in the first example.
An alternative solution, avoiding all "modern" features ;-]
Finally, I finished my task and I want to share some useful things.
Instead of
generate_series
I used this hook:To get list of URLs which I have to fill with the data I used this:
Then I combinet given data, let's call it
url_mapping
:And to get data by closest date I did the following:
I hope it will help someone.
Here's an ugly hack to get redshift to generate new rows into a table using a date in this case. This example limits the output to the previous 30 days. The ranges can be tweaked or removed. This same approach can be used for minutes, seconds, etc. as well.
To target a specific time range change the
sysdate
to a literal which would be the last day after the end of the range you want and the limit to how many days to cover.The insert would be something like so:
I wasn't able to test the insert so that might need some tweaking.
The reference to the
stv_blocklist
table could be any table with enough rows in it to cover the range limit in the with clause and is used to provide a seed for therow_number()
window function.Once you have the date only rows in place you can update them with the most recent full record like so:
This is pretty slow but for a smaller data set or a one-off it should be fine. I was able to test a similar query.
UPDATE: I think this version of the query to fill in the nulls should work better and account for domain and date. I tested a similar version.
It should be possible to combine this with the initial population query and do it all at once.