Following up from my previous question:
Using "Cursors" for paging in PostgreSQL
What is a good way to provide an API client with 1,000,000 database results?
We are currently using PostgreSQL. A few suggested methods:
- Paging using Cursors
- Paging using random numbers ( Add "GREATER THAN ORDER BY " to each query )
- Paging using LIMIT and OFFSET ( breaks down for very large data sets )
- Save information to a file and let the client download it
- Iterate through results, then POST the data to the client server
- Return only keys to the client, then let the client request the objects from Cloud files like Amazon S3 (still may require paging just to get the file names ).
What haven't I thought of that is stupidly simple and way better than any of these options?
The table has a primary key. Make use of it.
Instead of LIMIT
and OFFSET
, do your paging with a filter on the primary key. You hinted at this with your comment:
Paging using random numbers ( Add "GREATER THAN ORDER BY " to each
query )
but there's nothing random about how you should do it.
SELECT * FROM big_table WHERE id > $1 ORDER BY id ASC LIMIT $2
Allow the client to specify both parameters, the last ID it saw and the number of records to fetch. Your API will have to either have a placeholder, extra parameter, or alternate call for "fetch the first n IDs" where it omits the WHERE
clause from the query, but that's trivial.
This approach will use a fairly efficient index scan to get the records in order, generally avoiding a sort or the need to iterate through all the skipped records. The client can decide how many rows it wants at once.
This approach differs from the LIMIT
and OFFSET
approach in one key way: concurrent modification. If you INSERT
into the table with a key lower than a key some client has already seen, this approach will not change its results at all, whereas the OFFSET
approach will repeat a row. Similarly, if you DELETE
a row with a lower-than-already-seen ID the results of this approach will not change, whereas OFFSET
will skip an unseen row. There is no difference for append-only tables with generated keys, though.
If you know in advance that the client will want the whole result set, the most efficient thing to do is just send them the whole result set with none of this paging business. That's where I would use a cursor. Read the rows from the DB and send them to the client as fast as the client will accept them. This API would need to set limits on how slow the client was allowed to be to avoid excessive backend load; for a slow client I'd probably switch to paging (as described above) or spool the whole cursor result out to a temporary file and close the DB connection.
Important caveats:
- Requires a
UNIQUE
constraint / UNIQUE
index or PRIMARY KEY
to be reliable
- Different concurrent modification behaviour to limit/offset, see above
Have the API accept an offset to start from and the number of records to return. This is a sort of paging where the client can determine how many records to return in one page request. The API should also return the total number of records possible for the query so the client knows how many "pages", or optionally it can derive when it has retrieved the last records when the number of records returned is zero or less than the number of records requested. You can control this in your PostgresSQL query by using the OFFSET clause (which record to start retrieving at) and the LIMIT clause (number of records to return) in your SELECT statement.