BigQuery Pivot Data Rows Columns

2019-05-17 10:16发布

问题:

Im currently processing data in BigQuery then export into Excel to do the final Pivot table and was hoping to be able to create the same with the PIVOT option in BigQuery.

My Data set in big query looks like

Transaction_Month || ConsumerId || CUST_createdMonth
01/01/2015        || 1          || 01/01/2015
01/01/2015        || 1          || 01/01/2015
01/02/2015        || 1          || 01/01/2015
01/01/2015        || 2          || 01/01/2015
01/02/2015        || 3          || 01/02/2015
01/02/2015        || 4          || 01/02/2015
01/02/2015        || 5          || 01/02/2015
01/03/2015        || 5          || 01/02/2015
01/03/2015        || 6          || 01/03/2015
01/04/2015        || 6          || 01/03/2015
01/06/2015        || 6          || 01/03/2015
01/03/2015        || 7          || 01/03/2015
01/04/2015        || 8          || 01/04/2015
01/05/2015        || 8          || 01/04/2015
01/04/2015        || 9          || 01/04/2015

It is essentially an order table with customer information appended.

When i put this data into excel I add it to a pivot table, I add the CUST_createdMonth as a Row, Transaction_Month as a column and the value is a distinct Count of the ConsumerID

The output looks as follows

Is this sort of pivot possible in BigQuery?

回答1:

There is no nice way of doing this in BigQuery, but you can do it follow below idea

Step 1

Run below query

SELECT 'SELECT CUST_createdMonth, ' + 
   GROUP_CONCAT_UNQUOTED(
      'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
   ) 
   + ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
  SELECT Transaction_Month 
  FROM yourTable
  GROUP BY Transaction_Month
  ORDER BY Transaction_Month
) 

As a result - you will get string like below (it is formatted below for readability sake)

SELECT
  CUST_createdMonth,
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
  FROM yourTable 
GROUP BY
  CUST_createdMonth
ORDER BY
  CUST_createdMonth

Step 2

Just run above composed query

Result will be lik e below

CUST_createdMonth   m_01_01_2015    m_01_02_2015    m_01_03_2015    m_01_04_2015    m_01_05_2015    m_01_06_2015     
01/01/2015          2               1               0               0               0               0    
01/02/2015          0               3               1               0               0               0    
01/03/2015          0               0               2               1               0               1    
01/04/2015          0               0               0               2               1               0   

Note

Step 1 is helpful if you have many months to pivot so too much of manual work.
In this case - Step 1 helps you to generate your query

You can see more about pivoting in my other posts.

How to scale Pivoting in BigQuery?
Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.
You can also see below as simplified examples (if above one is too complex/verbose):
How to transpose rows to columns with large amount of the data in BigQuery/SQL?
How to create dummy variable columns for thousands of categories in Google BigQuery?
Pivot Repeated fields in BigQuery



回答2:

Actually Mikhail there is another way to transpose rows of an EAV type schema into columns, by using logging tables and querying the last CREATE TABLE entry to determine the latest table schema.

     CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
              RETURNS ARRAY<STRING> AS ((
                SELECT
                  SPLIT(
                    REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\\1,')
                  ,',')
              ));
        WITH valid_schema_columns AS (
          WITH array_output aS (SELECT
            jsonSchemaStringToArray(jsonSchema) AS column_names
          FROM (
            SELECT
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
              , ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
            FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
            WHERE
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
          ) AS t
          WHERE
            t.record_count = 1 -- grab the latest entry
          )
          -- this is actually what UNNESTS the array into standard rows
          SELECT
            valid_column_name
          FROM array_output
          LEFT JOIN UNNEST(column_names) AS valid_column_name

        )