Working with large offsets in BigQuery

2019-07-28 10:24发布

问题:

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:

Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.

回答1:

As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:

  • decide for few and as evenly distributed partitions as possible
  • count rows of each partition
  • cumulative sum of partitions to know later when to start where with counting rows
  • split up up work of counting rows
  • save new table with row count column for later use

As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely

WITH
  temp AS (
  SELECT
    *,
    -- cumulative sum of partition sizes so we know when to start counting rows here
    SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
  FROM (
      -- lag partition sizes to next partition
    SELECT
      *,
      LAG(qty) OVER (ORDER BY month) lagged
    FROM (
        -- get partition sizes
      SELECT
        EXTRACT(month FROM pickup_datetime) month,
        COUNT(1) qty
      FROM
        `nyc-tlc.green.trips_2014`
      GROUP BY
        1)) )
SELECT
  -- cumulative sum = last row of former partition, add to new row count
  cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
  *
FROM
  `nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
  temp
ON
  (month= EXTRACT(month FROM pickup_datetime))

Once you saved it as a new table you can use your new row column to query without losing performance:

SELECT
  *
FROM
  `project.dataset.your_new_table`
WHERE
  row BETWEEN 10000001
  AND 10000100

Quite a hassle, but does the trick.



回答2:

Offset in systems like BigQuery work by reading and discarding all results until the offset.

You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.

For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:

SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100

If you did this via OFFSET 100000, it takes 4s and the first row is:

pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1

If instead of offset, I had used those date and rate values, the query only takes 2.9s:

SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
  AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
  AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100

So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.

If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.



回答3:

Why not export the resulting table into GCS?

It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.

Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.