I am working with BigQuery scripting, I have written a simple WHILE loop which iterates through daily Google Analytics tables and sums the visits, now I'd like to write these results out to a table.
I've gotten as far as creating the table, but I can't capture the value of visits
from my SQL query to populate the table. Date
works fine, because it is defined outside of the SQL. I tried to DECLARE
the value of visits
with a new variable, but again this does not work because it's not known outside of the statement.
SET vis = visits;
How can I correctly write my results out to a table?
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),"-","");
DECLARE vis INT64;
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SELECT d, SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix
GROUP BY Date;
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET vis = visits;
INSERT INTO test.looped_results VALUES (d, visits);
END WHILE;
Update: I also tried an alternative solution, assigning visits to it's own variable, but this produces the same error:
WHILE d > '2019-10-01' DO
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `mindful-agency-136314.43786551.ga_sessions_*`
WHERE _table_suffix = pfix);
INSERT INTO test.looped_results VALUES (d, vis_count);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
END WHILE;
Results:
In my results I see the correct number of rows created, with the correct dates, but the value of visits
for each is the value for the most recent day.
Actually, you need to update the pfix
variable in there. Also, it is a good idea to instantiate the visits
. Finally, your GROUPBY
doesn't necessarily need a dimension if you are providing it with a pfix
constraint.
This should do it:
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),'-','');
DECLARE visits int64;
SET visits = 0;
CREATE OR REPLACE TABLE project.dataset.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET visits = (SELECT SUM(totals.visits) FROM `project.dataset.ga_sessions_*` WHERE _table_suffix = pfix);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
INSERT INTO dataset.looped_results VALUES (d, visits);
END WHILE;
Hope it helps.
I would also move INSERT INTO
outside of the WHILE
loop by collecting result into result
variable (along with few other minor changes) as in below example
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING;
DECLARE vis_count INT64;
DECLARE result ARRAY<STRUCT<vis_date DATE, vis_count INT64>> DEFAULT [];
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix);
SET result = ARRAY_CONCAT(result, [STRUCT(d, vis_count)]);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
END WHILE;
INSERT INTO test.looped_results SELECT * FROM UNNEST(result);
Note: I hope your example is for scripting learning purpose and not for production as whenever possible we should stick with set based processing which can be easily done in your case
Here is a better way which is faster and without using a loop.
Basically, you form an array of suffix and do SELECT/INSERT in single query:
DECLARE date_range ARRAY<DATE> DEFAULT
GENERATE_DATE_ARRAY(DATE '2019-10-01', DATE '2019-10-10', INTERVAL 1 DAY);
DECLARE suffix_array ARRAY<STRING>
DEFAULT (SELECT ARRAY_AGG(REGEXP_REPLACE(CAST(dates AS STRING),"-",""))
FROM UNNEST(date_range) dates);
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
INSERT INTO test.looped_results
SELECT Date, SUM(totals.visits)
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix in UNNEST(suffix_array);
GROUP BY Date;
Having reviewed my code (several times!) I realized that I wasn't refreshing the variable which transforms the data into the table prefix within the loop.
Here is a working version of the script, where I set pfix
at the end of the loop:
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),"-","");
DECLARE vis_count INT64;
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix);
INSERT INTO test.looped_results VALUES (d, vis_count);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
END WHILE;