Unnest and totals.timeOnSite (BigQuery and Google

2020-05-01 09:08发布

I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it's stored as seconds in the raw data), and then I want to break it down on content_group and a custom variable that is called content_level.

Problem arises because content_group and content_level are both nested in arrays, and timeOnSite is a totals.-stored variable that gets inflated if when used in a query that include and unnesting. (content_group is a normal hits.-nested variable, while content_level is nested in customDimensions that is nested in hits (a second level nested variable) (Will and Thomas C explain well why this problem emerges in this question Google Analytics Metrics are inflated when extracting hit level data using BigQuery , but I was unable to apply their advice to the totals.timeOnSite metric)

#StandardSQL
SELECT   
 date,   
 content_group,   
 content_level,  
 SUM(sessions) AS sessions,   
 SUM(sessions2) AS sessions2, 
 SUM(time_on_site) AS time_on_site   
FROM (   
     SELECT   
       date AS date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,   
       SUM(totals.visits) AS sessions,   
       COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,   
       SUM(totals.timeOnSite)/3600 AS time_on_site   
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

(I use a subquery because I'm planning on pulling data from several tables using UNION_ALL, but I omitted that syntax because I deemed it not relevant for the question.)

Questions:

*Is it possible to make "local unnestings" for both hits. and hits.customDimensions so that it would be possible to use totals.timeOnSite in my query without it being inflated?

*Is it possible to make a workaround for time on site like I've made with sessions and sessions2?

*Is there a third, hidden solution to this problem?

2条回答
Rolldiameter
2楼-- · 2020-05-01 09:57

I couldn't fully test this one but it seems to be working against my dataset:

SELECT
  DATE,
  COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
  AVG(tos) avg_time_on_site,
  content_group,
  content_level
FROM(
  SELECT   
   date AS date,   
   fullvisitorid fv,
   visitid v,
   ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,   
   ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,   
   totals.timeOnSite / 3600 AS tos 
  FROM `dataset_id.ga_sessions_20170101`
  WHERE totals.timeOnSite IS NOT NULL
  )
CROSS JOIN UNNEST(content_group) content_group
LEFT JOIN UNNEST(content_level) content_level
GROUP BY
  DATE, content_group, content_level

What I tried to do is first to avoid the UNNEST(hits) operation on the entire dataset. Therefore, in the very first SELECT statement, content_group and content_level are stored as ARRAYs.

In the next SELECT, I unnested both of those ARRAYs and counted for the total sessions and the average time on site while grouping for the desired fields (I used the average here as it seems to make more sense when dealing with time on site but if you need the summation you can just change the AVG to SUM).

You won't have the problem of repeated timeOnSite in this query because the outer UNNEST(hits) was avoided. When the UNNEST(content_group) and UNNEST(content_level) happens, each value inside those ARRAYs gets associated only once to its correspondent time_on_site so no duplication is happening.

查看更多
霸刀☆藐视天下
3楼-- · 2020-05-01 10:07

It might seem odd that I'm answering my own question like this, but a contact of mine from outside of Stack Overflow helped me solve this, so it's actually his answer rather than mine.

The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)

#StandardSQL
SELECT   
 iso_date,   
 content_group,   
 content_level,  
 COUNT(DISTINCT SessionId) AS sessions, 
 SUM(session_duration) AS session_duration 
FROM (   
     SELECT   
       date AS iso_date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,  
       CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId, 
       (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration 
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131" 
       AND (SELECT 
              MAX(IF(index=51, value, NULL)) 
            FROM 
              UNNEST(hits.customDimensions) 
            WHERE 
              value IN ("web", "phone", "tablet")
            ) IS NOT NULL 
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.

A more accurate way of calculating sessions is also provided.

查看更多
登录 后发表回答