How to store the result of query on the current ta

2019-06-26 00:06发布

问题:

I have a structure

  {
    id: "123",
    scans:[{
       "scanid":"123",
       "status":"sleep"
      }]
  },
  {
    id: "123",
    scans:[{
       "scanid":"123",
       "status":"sleep"
      }]
  }

Query to remove duplicate:

      SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (PARTITION BY id)
              row_number,
      FROM table1
    )
    WHERE row_number = 1

I specified destination table as table1.

Here I have made scans as repeated records, scanid as string and status as string. But when I do some query (I am making a query to remove duplicate) and overwrite the existing table, the table schema is changed. It becomes scans_scanid(string) and scans_status(string). Scans record schema is changed now. Please suggest where am I going wrong?

回答1:

It is known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.

Try below workaround
Note, I use INTEGER for id and scanid. If they should be STRING you need to
a. make change in output schema section
as well as
b. remove use of parseInt() function in t = {scanid:parseInt(x[0]), status:x[1]}

SELECT id, scans.scanid, scans.status 
FROM JS(
  (      // input table
    SELECT id, NEST(CONCAT(STRING(scanid), ',', STRING(status))) AS scans
    FROM (
      SELECT id, scans.scanid, scans.status 
      FROM (
        SELECT id, scans.scanid, scans.status, 
               ROW_NUMBER() OVER (PARTITION BY id) AS dup
        FROM table1
      ) WHERE dup = 1  
    ) GROUP BY id
  ),
  id, scans,     // input columns
  "[{'name': 'id', 'type': 'INTEGER'},    // output schema
    {'name': 'scans', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'scanid', 'type': 'INTEGER'},
       {'name': 'status', 'type': 'STRING'}
     ]    
    }
  ]",
  "function(row, emit){    // function 
    var c = [];
    for (var i = 0; i < row.scans.length; i++) {
      x = row.scans[i].toString().split(',');
      t = {scanid:parseInt(x[0]), status:x[1]}
      c.push(t);
    };
    emit({id: row.id, scans: c});  
  }"
)

Here I use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. Also have in mind - they are quite a candidates for being qualified as expensive High-Compute queries

Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.



回答2:

1) If you run the query on the web UI, the result is automatically flattened, so that's why you see the schema is changed.

You need to run your query and write to a destination table, you have options on the web UI also to do this.

2) If you don't run your query on the web UI but still see schema changed, you should make explicit selects so the schema is retained for you eg:

select 'foo' as scans.scanid

This creates for you a record like output, but it won't be a repeated record for that please read further.

3) For some use cases you may need to use the NEST(expr) function which

Aggregates all values in the current aggregation scope into a repeated field. For example, the query "SELECT x, NEST(y) FROM ... GROUP BY x" returns one output record for each distinct x value, and contains a repeated field for all y values paired with x in the query input. The NEST function requires a GROUP BY clause.

BigQuery automatically flattens query results, so if you use the NEST function on the top level query, the results won't contain repeated fields. Use the NEST function when using a subselect that produces intermediate results for immediate use by the same query.