We have a premium Google Analytics account, which will give us access to row level event data. This data is exported daily to Google Bigquery and for every day a new table will be created in a data set.
Until a week ago we were able to export this Google Analytics data to CSV by flattening it to a temp staging table and then export it to CSV. The query we used to do so was like this:
SELECT * FROM
flatten([xxxxxxxx.ga_sessions_20140829],hits),
flatten([xxxxxxxx.ga_sessions_20140828],hits),
flatten([xxxxxxxx.ga_sessions_20140827],hits),
flatten([xxxxxxxx.ga_sessions_20140826],hits)
Yesterday I noticed this query will now throw an error:
Cannot output multiple independently repeated fields at the same time. Found customDimensions_value and hits_product_productSKU
Apparently something has changed regarding the flatten() function, as hits_product_productSKU is child of the hits field.
I also tried some old queries which were in the query history, but they're broken as well.
There is no release note mentioning any change, so what is happening?
How can I export everything in the Google Analytics BigQuery export files again?
This is actually the result of a bugfix I submitted last week, and is preventing you from getting incorrect results.
BigQuery by default flattens all query results before returning them, but we only want to flatten one independently repeated field to avoid a cross-product expansion of data. The bug was that our checks for multiple repeated fields failed to take into account repeatedness of parent records in some cases, which caused us to fail to flatten some independently repeated fields. This meant that we could return flat rows where independently repeated values were actually "flattened" into dependently repeated values, instead of generating the cross-product, which is actually a wrong result.
What you're seeing here is a result of the stricter check: you have (at least) two repeated fields in your output schema before we attempt to flatten the results.
Another important thing to note is that the FLATTEN([table-value], [field]) function only flattens the repeatedness of the field you specify as the second argument. When you say flatten([xxxxxxxx.ga_sessions_20140829],hits), you are flattening only the "hits" record. If you also want to flatten its repeated children (product, promotion, etc.) you must explicitly add another flatten for those fields, like:
FLATTEN(FLATTEN([xxxxxxxx.ga_sessions_20140829],hits),hits.product)
--
You have a couple options to make your example work:
1) Select fewer fields. If you only care about getting flattened output of a few fields, you can remove the independently repeated fields from your query results by explicitly selecting only the fields you care about.
2) Add more FLATTENs. You'll need to flatten on each repeated field, which looks to include at least hits, hits.product, and customDimensions. You may find that the error message will then complain about different repeated fields: add more FLATTENs on the repeated fields in your schema until it works.
If you're using the BigQuery Web Console, select a Destination Table, click Allow Large Results and uncheck Flatten Results.
If you're using bq command-line tool:
bq query --allow_large_results --noflatten --destination_table NAME_OF_TABLE "SELECT * from FLATTEN( [dataset], hits)"
I suspect the tables generated by Google Analytics Premium have been augmented with the addition of new columns. Adding columns shouldn't be a problem, unless your queries use the * selector, and one of the new columns happens to contain nested values.
Recommended solution: Instead of using *, explicitly ask for the columns you need.
standardsql
For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again!
That way you save a lot of space in storage and you queries might get faster.
You can think of GA data being 3 relational tables:
- sessions table
- hits table
- products table
To get sessions table you can use something like
-- session table
SELECT
* EXCEPT(hits,
customDimensions),
CONCAT('{',(
SELECT
STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
FROM
t.customdimensions),'}') as customDimensions
FROM
`project.dataset.ga_sessions_20171031` AS t
customDimensions is aggregated to a json string.
Similarly with the hits table:
-- hits table
SELECT
fullvisitorid,
visitid,
visitstarttime,
h.* EXCEPT(product,
customdimensions, customMetrics, customVariables, promotion, experiment),
CONCAT('{',(
SELECT
STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
FROM
h.customdimensions),'}') AS hitsCustomDimensions,
CONCAT('{',(
SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
FROM h.custommetrics),'}') AS hitsCustomMetrics
FROM
`project.dataset.ga_sessions_20171031` AS t,
t.hits AS h
You can add promotion and experiment as I did with customDimensions using a sub-select.
Join sessions with hits using a concatenation of fullvisitorid + visitstarttime/visitid (GA sessions include midnight split: use visitStartTime instead of visitid! If you want to ignore midnight split use visitid - it stays the same despite the split)
For products you'd add the hit number to your 'session id' to get a unique identifier:
-- product table
SELECT
fullvisitorid,
visitid,
visitstarttime,
h.hitNumber,
p.* EXCEPT(customdimensions, customMetrics),
CONCAT('{',(
SELECT STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
FROM p.customdimensions),'}') AS productsCustomDimensions,
CONCAT('{',(
SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
FROM p.custommetrics),'}') AS productsCustomMetrics
FROM
`project.dataset.ga_sessions_20171031` AS t,
t.hits AS h, h.product as p
Now you have 3 relational / "flat" tables you can join as needed in any relational db.
You have to use flatten the query before connecting. Simply pick the fields that are mentioned in the error and use the nested flattening:
SELECT * FROM
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140829],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140828],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140827],hits),customDimensions_value),hits_product_productSKU), flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140826],hits),customDimensions_value),hits_product_productSKU)