I have a structure
{
id: "123",
scans:[{
"scanid":"123",
"status":"sleep"
}]
},
{
id: "123",
scans:[{
"scanid":"123",
"status":"sleep"
}]
}
Query to remove duplicate:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id)
row_number,
FROM table1
)
WHERE row_number = 1
I specified destination table as table1.
Here I have made scans as repeated records, scanid as string and status as string. But when I do some query (I am making a query to remove duplicate) and overwrite the existing table, the table schema is changed. It becomes scans_scanid(string)
and scans_status(string)
. Scans record schema is changed now. Please suggest where am I going wrong?
It is known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.
Try below workaround
Note, I use INTEGER for id and scanid. If they should be STRING you need to
a. make change in output schema section
as well as
b. remove use of parseInt() function in t = {scanid:parseInt(x[0]), status:x[1]}
SELECT id, scans.scanid, scans.status
FROM JS(
( // input table
SELECT id, NEST(CONCAT(STRING(scanid), ',', STRING(status))) AS scans
FROM (
SELECT id, scans.scanid, scans.status
FROM (
SELECT id, scans.scanid, scans.status,
ROW_NUMBER() OVER (PARTITION BY id) AS dup
FROM table1
) WHERE dup = 1
) GROUP BY id
),
id, scans, // input columns
"[{'name': 'id', 'type': 'INTEGER'}, // output schema
{'name': 'scans', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'scanid', 'type': 'INTEGER'},
{'name': 'status', 'type': 'STRING'}
]
}
]",
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.scans.length; i++) {
x = row.scans[i].toString().split(',');
t = {scanid:parseInt(x[0]), status:x[1]}
c.push(t);
};
emit({id: row.id, scans: c});
}"
)
Here I use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. Also have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources
relative to the number of bytes processed. Typically, such queries
contain a very large number of JOIN or CROSS JOIN clauses or complex
User-defined Functions.
1) If you run the query on the web UI, the result is automatically flattened, so that's why you see the schema is changed.
You need to run your query and write to a destination table, you have options on the web UI also to do this.
2) If you don't run your query on the web UI but still see schema changed, you should make explicit selects so the schema is retained for you eg:
select 'foo' as scans.scanid
This creates for you a record like output, but it won't be a repeated record for that please read further.
3) For some use cases you may need to use the NEST(expr)
function which
Aggregates all values in the current aggregation scope into a repeated
field. For example, the query "SELECT x, NEST(y) FROM ... GROUP BY x"
returns one output record for each distinct x value, and contains a
repeated field for all y values paired with x in the query input. The
NEST function requires a GROUP BY clause.
BigQuery automatically flattens query results, so if you use the NEST
function on the top level query, the results won't contain repeated
fields. Use the NEST function when using a subselect that produces
intermediate results for immediate use by the same query.