Flattening Google Analytics data (with repeated fi

2019-01-24 04:58发布

We have a premium Google Analytics account, which will give us access to row level event data. This data is exported daily to Google Bigquery and for every day a new table will be created in a data set.

Until a week ago we were able to export this Google Analytics data to CSV by flattening it to a temp staging table and then export it to CSV. The query we used to do so was like this:

SELECT * FROM 
    flatten([xxxxxxxx.ga_sessions_20140829],hits),
    flatten([xxxxxxxx.ga_sessions_20140828],hits),
    flatten([xxxxxxxx.ga_sessions_20140827],hits),
    flatten([xxxxxxxx.ga_sessions_20140826],hits)

Yesterday I noticed this query will now throw an error:

Cannot output multiple independently repeated fields at the same time. Found customDimensions_value and hits_product_productSKU

Apparently something has changed regarding the flatten() function, as hits_product_productSKU is child of the hits field.

I also tried some old queries which were in the query history, but they're broken as well. There is no release note mentioning any change, so what is happening?

How can I export everything in the Google Analytics BigQuery export files again?

5条回答
我命由我不由天
2楼-- · 2019-01-24 05:17

If you're using the BigQuery Web Console, select a Destination Table, click Allow Large Results and uncheck Flatten Results.

If you're using bq command-line tool:

bq query --allow_large_results --noflatten --destination_table NAME_OF_TABLE  "SELECT * from FLATTEN( [dataset], hits)"
查看更多
放我归山
3楼-- · 2019-01-24 05:18

I suspect the tables generated by Google Analytics Premium have been augmented with the addition of new columns. Adding columns shouldn't be a problem, unless your queries use the * selector, and one of the new columns happens to contain nested values.

Recommended solution: Instead of using *, explicitly ask for the columns you need.

查看更多
Summer. ? 凉城
4楼-- · 2019-01-24 05:18

You have to use flatten the query before connecting. Simply pick the fields that are mentioned in the error and use the nested flattening:

SELECT * FROM 
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140829],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140828],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140827],hits),customDimensions_value),hits_product_productSKU),    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140826],hits),customDimensions_value),hits_product_productSKU)
查看更多
Emotional °昔
5楼-- · 2019-01-24 05:28

This is actually the result of a bugfix I submitted last week, and is preventing you from getting incorrect results.

BigQuery by default flattens all query results before returning them, but we only want to flatten one independently repeated field to avoid a cross-product expansion of data. The bug was that our checks for multiple repeated fields failed to take into account repeatedness of parent records in some cases, which caused us to fail to flatten some independently repeated fields. This meant that we could return flat rows where independently repeated values were actually "flattened" into dependently repeated values, instead of generating the cross-product, which is actually a wrong result.

What you're seeing here is a result of the stricter check: you have (at least) two repeated fields in your output schema before we attempt to flatten the results.

Another important thing to note is that the FLATTEN([table-value], [field]) function only flattens the repeatedness of the field you specify as the second argument. When you say flatten([xxxxxxxx.ga_sessions_20140829],hits), you are flattening only the "hits" record. If you also want to flatten its repeated children (product, promotion, etc.) you must explicitly add another flatten for those fields, like:

FLATTEN(FLATTEN([xxxxxxxx.ga_sessions_20140829],hits),hits.product)

--

You have a couple options to make your example work:

1) Select fewer fields. If you only care about getting flattened output of a few fields, you can remove the independently repeated fields from your query results by explicitly selecting only the fields you care about.

2) Add more FLATTENs. You'll need to flatten on each repeated field, which looks to include at least hits, hits.product, and customDimensions. You may find that the error message will then complain about different repeated fields: add more FLATTENs on the repeated fields in your schema until it works.

查看更多
Rolldiameter
6楼-- · 2019-01-24 05:37

For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again! That way you save a lot of space in storage and you queries might get faster.

You can think of GA data being 3 relational tables:

  • sessions table
  • hits table
  • products table

To get sessions table you can use something like

-- session table 
SELECT
  * EXCEPT(hits,
    customDimensions),
  CONCAT('{',(
    SELECT
      STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
    FROM
      t.customdimensions),'}') as customDimensions
FROM
  `project.dataset.ga_sessions_20171031` AS t

customDimensions is aggregated to a json string.

Similarly with the hits table:

-- hits table
SELECT
 fullvisitorid,
 visitid,
 visitstarttime,
 h.* EXCEPT(product,
 customdimensions, customMetrics, customVariables, promotion, experiment),
 CONCAT('{',(
   SELECT
     STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
   FROM
     h.customdimensions),'}') AS hitsCustomDimensions,
 CONCAT('{',(
   SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
   FROM h.custommetrics),'}') AS hitsCustomMetrics
FROM
 `project.dataset.ga_sessions_20171031` AS t,
 t.hits AS h

You can add promotion and experiment as I did with customDimensions using a sub-select.

Join sessions with hits using a concatenation of fullvisitorid + visitstarttime/visitid (GA sessions include midnight split: use visitStartTime instead of visitid! If you want to ignore midnight split use visitid - it stays the same despite the split)

For products you'd add the hit number to your 'session id' to get a unique identifier:

-- product table
SELECT
  fullvisitorid,
  visitid,
  visitstarttime,
  h.hitNumber,
  p.* EXCEPT(customdimensions, customMetrics),
  CONCAT('{',(
    SELECT STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
    FROM p.customdimensions),'}') AS productsCustomDimensions,
  CONCAT('{',(
    SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
    FROM p.custommetrics),'}') AS productsCustomMetrics
FROM
  `project.dataset.ga_sessions_20171031` AS t,
  t.hits AS h, h.product as p

Now you have 3 relational / "flat" tables you can join as needed in any relational db.

查看更多
登录 后发表回答