Google Analytics Metrics are inflated when extract

2020-02-16 02:21发布

问题:

I'm trying to display the source property name within the Google Analytics roll up property I have linked to bigquery. Issue is, is that when I try the below some of the metrics become hugely inflated. I'm guessing this is to do with repeated fields but not sure what to do about it. I've tried a number of workarounds like using "max" but this doesn't display every property name.

All metrics except users and visits seem to be inflated.

SELECT
  date,
  MAX(CASE
      WHEN EXISTS(  SELECT 1  FROM UNNEST(hits) hits  WHERE REGEXP_CONTAINS(hits.sourcePropertyInfo.sourcePropertyTrackingId, r'82272640')) THEN 'MUG'
      WHEN EXISTS (
    SELECT
      1
    FROM
      UNNEST(hits) hits
    WHERE
      hits.sourcePropertyInfo.sourcePropertyTrackingId = 'Social') THEN 'Social'ELSE 'Website' END) AS Property,
  geoNetwork.country AS Country,
 COUNT(DISTINCT CONCAT(cast(visitId AS STRING),fullVisitorId)) as visits,
 sum(totals.visits) as visits2,
  COUNT(DISTINCT(fullVisitorId)) AS Users,
 h.sourcePropertyInfo.sourcePropertyDisplayName as display,
  SUM((
    SELECT
      SUM(latencyTracking.pageLoadTime)
    FROM
      UNNEST(hits)
    WHERE
      page.pagePath = '/' ))/SUM((
    SELECT
      SUM(latencyTracking.pageLoadSample)
    FROM
      UNNEST(hits)
    WHERE
      page.pagePath = '/')) AS pageloadspeed,
  SUM(totals.newVisits) AS new_,
  SUM(totals.screenviews) AS PAGEVIEWS,
  SUM(totals.bounces) AS BOUNCES,
   sum(CASE
      WHEN device.isMobile = TRUE THEN (totals.visits)
      ELSE 0 END) mobilevisits,
  SUM(CASE
      WHEN trafficSource.medium = 'organic' THEN (totals.visits)
      ELSE 0 END) organicvisits,
  SUM(CASE
      WHEN EXISTS(  SELECT 1  FROM UNNEST(hits) hits  WHERE REGEXP_CONTAINS(hits.eventInfo.eventAction,'register$|registersuccess|new registration|account signup|registro')) THEN 1
      ELSE 0 END) AS NewRegistrations,
  SUM(CASE
      WHEN EXISTS(  SELECT 1  FROM UNNEST(hits) hits  WHERE REGEXP_CONTAINS(hits.eventInfo.eventAction, 'add to cart|add to bag|click to buy|ass to basket|comprar|addtobasket::')) THEN 1
      ELSE 0 END) AS ClickToBuy,
  SUM(totals.transactions) AS Transactions
FROM
  `project.dataset.ga_sessions_*`, UNNEST(hits) as h
WHERE
  1 = 1
  AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-05-01')
  AND TIMESTAMP('2017-05-01')
GROUP BY
  date,
  Country,
  display
ORDER BY
  visits DESC;

EDIT:

I've tried simply removing UNNEST(HITS) hits as h from the FROM clause, this gave me the following error:

Error: Cannot access field sourcePropertyInfo on a value with type ARRAY> at [16:14]

I've also tried using it in a subquery as follows:

(select h.sourcePropertyInfo.sourcePropertyDisplayName from unnest(hits) h) as displayname, 

and get the error:

Scalar subquery produced more than one element

回答1:

As you need to compute several values on the hits level maybe unnesting the field hits is the best approach. The downside is that you lose the totals field aggregation for the session level but still you can work it around.

As an example:

SELECT
  date,
  CASE
    WHEN REGEXP_CONTAINS(h.sourcePropertyInfo.sourcePropertyTrackingId, r'82272640') THEN 'MUG'
    WHEN h.sourcePropertyInfo.sourcePropertyTrackingId = 'Social' THEN 'Social'ELSE 'Website'
  END AS Property,
  geoNetwork.country AS Country,
  COUNT(DISTINCT CONCAT(CAST(visitId AS STRING),fullVisitorId)) AS visits,
  COUNT(DISTINCT(fullVisitorId)) AS Users,
  h.sourcePropertyInfo.sourcePropertyDisplayName AS display,
  SUM(CASE
      WHEN REGEXP_CONTAINS(h.page.pagepath, r'/') THEN h.latencyTracking.pageLoadTime END) / SUM(CASE
      WHEN REGEXP_CONTAINS(h.page.pagepath, r'/') THEN h.latencyTracking.pageLoadSample END) AS pageloadspeed,
  COUNT(DISTINCT
    CASE
      WHEN totals.newVisits = 1 THEN CONCAT(CAST(visitId AS STRING),fullVisitorId) END) new_visits,
  COUNT(CASE
         WHEN h.type = 'PAGE' THEN h.page.pagepath END) pageviews,
  SUM(CASE
       WHEN (h.isentrance = TRUE AND h.isexit = TRUE) THEN 1 END) bounces,
  COUNT(DISTINCT (CASE
        WHEN device.isMobile = TRUE THEN CONCAT(CAST(visitId AS STRING),fullVisitorId) END)) mobilevisits,
  COUNT(DISTINCT (CASE
        WHEN trafficSource.medium = 'organic' THEN CONCAT(CAST(visitId AS STRING),fullVisitorId) END)) organicvisits,
  SUM(CASE
       WHEN REGEXP_CONTAINS(h.eventInfo.eventAction,'register$|registersuccess|new registration|account signup|registro') THEN 1 END) AS NewRegistrations,
  SUM(CASE
       WHEN REGEXP_CONTAINS(h.eventInfo.eventAction, 'add to cart|add to bag|click to buy|ass to basket|comprar|addtobasket::') THEN 1 END) AS ClickToBuy,
  COUNT(h.transaction.transactionid) transactions
FROM
  `project_id.dataset_id.ga_sessions_*`,
  UNNEST(hits) AS h
WHERE
  1 = 1
  AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-05-01') AND TIMESTAMP('2017-05-01')
GROUP BY
  date,
  Country,
  display,
  Property

I ran it against our dataset and it seems to be working. Some changes I did:

  • Removed the MAX operation for the Property and added it to the group by.
  • pageviews was considered as the count of hits where hit.type = 'PAGE'. Not sure if this is the same for screenviews though.
  • bounce is computed when there's an entrance and exit event.
  • Total transactions is a count on transaction ids (hopefully this field is being filled in your dataset as well).


回答2:

You're flattening your table in your most outer FROM statement (ie here:

FROM project.dataset.ga_sessions_*, UNNEST(hits) as h)

All your session level dimensions eg device.* or totals.* values like totals.transactions are already rolled up to a session level, so when you flatten the table by unnesting the hits, these totals values get written as many times as there are hits. Example: Let's say there are 30 hits in one session and 2 transactions, since you flatten/unnest your hits, you will be left with 30 rows containing totals.transactions = 2, so when you sum over them, the result would be that there were 60 transactions in this session. Your users and visits don't get affected by this since you count distinct them, so any dupes get eliminated.

It looks to me as if you're query will work if you just remove the , UNNEST(hits) as h if you remove or adapt this line

h.sourcePropertyInfo.sourcePropertyDisplayName as display

since you already unnest the hits within the select statement where required apart from in this particular line.



回答3:

after UNNESTing an array, use MAX() instead SUMMing them up to report on session level matrices:

use:

  • MAX(totals.screenviews) AS PAGEVIEWS,
  • MAX(totals.bounces) AS BOUNCES,
  • MAX(totals.transactions) AS Transactions
  • ...
  • ...

instead of:

  • SUM(totals.screenviews) AS PAGEVIEWS,
  • SUM(totals.bounces) AS BOUNCES,
  • SUM(totals.transactions) AS Transactions

this should partially solve your issue. Let me know how it goes?



回答4:

I think the reason why the calculated bounce rate is much higher in the query from William Fuks is the following

WHEN (h.isentrance = TRUE AND h.isexit = TRUE) THEN 1 END) bounces

It seems that isEntrance and isExit only happens on PAGE hits and so does not account for events. Hence the over count of bounces is due to single page views that might have had one or more interaction events happen on the page.