I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded
error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
If you did this via OFFSET 100000, it takes 4s and the first row is:
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now
row_number() over ()
exceeds resources. This basically means you have to split up the work of counting rows:As partitions I used
EXTRACT(month FROM pickup_datetime)
as it distributes nicelyOnce you saved it as a new table you can use your new row column to query without losing performance:
Quite a hassle, but does the trick.