I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
- Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
- Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
- How to provide an API client with 1,000,000 database results?
- Using "Cursors" for paging in PostgreSQL
- Iterate over large external postgres db, manipulate rows, write output to rails postgres db
- offset/limit performance optimization
- If PostgreSQL count(*) is always slow how to paginate complex queries?
- How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.