I'm working on counting all visitors that submitted postcode on our homepage. I came up with following query in legacy SQL:
SELECT fullVisitorId, visitStartTime
FROM TABLE_DATE_RANGE([ga_sessions_], TIMESTAMP('2017-01-29'), CURRENT_TIMESTAMP())
where hits.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and hits.type = 'EVENT'
and hits.eventInfo.eventCategory = 'Homepage'
and hits.eventInfo.eventAction = 'Submit Postcode';
I then wanted to convert it to standard SQL to use within CTE and came up with this one that doesn't seem right though.
SELECT fullVisitorId, visitStartTime
FROM ``ga_sessions_*``, UNNEST(hits) as h
where
_TABLE_SUFFIX > '2017-01-29'
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
The first one processes 327 MB and returns 4117 results, the second one processes 6.98 GB and returns 60745 results.
I've looked at the migration guide, but it didn't prove very helpful for me.
ga_sessions has standard schema of GA import into Bigquery.
It looks like difference is coming from the fact that with Standard SQL you are flattening the table on hits
when you CROSS JOIN UNNEST(hits)
in the FROM
clause, and therefore adding more rows to the result. More equivalent query would be:
#standardSQL
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions_*`
where
_TABLE_SUFFIX > '20170129'
and visitStartTime > 1483228800
and EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE h.type = 'EVENT'
and h.page.pagePath = '/broadband/'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode');
What happened here is that as _TABLE_SUFFIX
is a string so when you do:
_TABLE_SUFFIX > '2017-01-29'
You will end up selecting way more tables then expected as string comparisons is different from number comparisons.
One possible way to fix that is by parsing the string to DATE
type:
SELECT fullVisitorId, visitStartTime
FROM `ga_sessions*`, UNNEST(hits) as h
where parse_date("%Y%m%d", regexp_extract(_table_suffix, r'.*_(.*)')) >= parse_date("%Y-%m-%d", '2017-01-29')
AND h.page.pagePath = '/broadband/'
and visitStartTime > 1483228800
and h.type = 'EVENT'
and h.eventInfo.eventCategory = 'Homepage'
and h.eventInfo.eventAction = 'Submit Postcode';
Where the parse_date operation first casts the string to DATE
and then the comparison is made.
Notice as well that I changed the wildcard selection to ga_sessions and then using the REGEX_EXTRACT
I consider only what comes after the "_" character. By doing so, you'll be able to select "intraday" tables as well.