I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it's stored as seconds in the raw data), and then I want to break it down on content_group and a custom variable that is called content_level.
Problem arises because content_group and content_level are both nested in arrays, and timeOnSite is a totals.-stored variable that gets inflated if when used in a query that include and unnesting. (content_group is a normal hits.-nested variable, while content_level is nested in customDimensions that is nested in hits (a second level nested variable) (Will and Thomas C explain well why this problem emerges in this question Google Analytics Metrics are inflated when extracting hit level data using BigQuery , but I was unable to apply their advice to the totals.timeOnSite metric)
#StandardSQL
SELECT
date,
content_group,
content_level,
SUM(sessions) AS sessions,
SUM(sessions2) AS sessions2,
SUM(time_on_site) AS time_on_site
FROM (
SELECT
date AS date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
SUM(totals.visits) AS sessions,
COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,
SUM(totals.timeOnSite)/3600 AS time_on_site
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
(I use a subquery because I'm planning on pulling data from several tables using UNION_ALL, but I omitted that syntax because I deemed it not relevant for the question.)
Questions:
*Is it possible to make "local unnestings" for both hits. and hits.customDimensions so that it would be possible to use totals.timeOnSite in my query without it being inflated?
*Is it possible to make a workaround for time on site like I've made with sessions and sessions2?
*Is there a third, hidden solution to this problem?
I couldn't fully test this one but it seems to be working against my dataset:
What I tried to do is first to avoid the
UNNEST(hits)
operation on the entire dataset. Therefore, in the very firstSELECT
statement,content_group
andcontent_level
are stored as ARRAYs.In the next
SELECT
, I unnested both of those ARRAYs and counted for the total sessions and the average time on site while grouping for the desired fields (I used the average here as it seems to make more sense when dealing with time on site but if you need the summation you can just change theAVG
toSUM
).You won't have the problem of repeated
timeOnSite
in this query because the outerUNNEST(hits)
was avoided. When theUNNEST(content_group)
andUNNEST(content_level)
happens, each value inside those ARRAYs gets associated only once to its correspondenttime_on_site
so no duplication is happening.It might seem odd that I'm answering my own question like this, but a contact of mine from outside of Stack Overflow helped me solve this, so it's actually his answer rather than mine.
The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.
A more accurate way of calculating sessions is also provided.