Possible Duplicate:
How to provide an API client with 1,000,000 database results?
Wondering of the use of Cursors is a good way to implement "paging" using PostgreSQL.
The use case is that we have upwards 100,000 rows that we'd like to make available to our API clients. We thought a good way to make this happen would be to allow the client to request the information in batches ( pages ). The client could request 100 rows at a time. We would return the 100 rows as well as a cursor, then when the client was ready, they could request the next 100 rows using the cursor that we sent to them.
However, I'm a little hazy on how cursors work and exactly how and when cursors should be used:
- Do the cursors require that a database connection be left open?
- Do the cursors run inside a transaction, locking resources until they are "closed"?
- Are there any other "gotchas" that I'm not aware of?
- Is there another, better way that this situation should be handled?
Thanks so much!
Cursors are a reasonable choice for paging in smaller intranet applications that work with large data sets, but you need to be prepared to discard them after a timeout. Users like to wander off, go to lunch, go on holiday for two weeks, etc, and leave their applications running. If it's a web-based app there's even the question of what "running" is and how to tell if the user is still around.
They are not suitable for large-scale applications with high client counts and clients that come and go near-randomly like in web-based apps or web APIs. I would not recommend using cursors in your application unless you have a fairly small client count and very high request rates ... in which case sending tiny batches of rows will be very inefficient and you should think about allowing range-requests etc instead.
Cursors have several costs. If the cursor is not WITH HOLD
you must keep a transaction open. The open transaction can prevent autovacuum from doing its work properly, causing table bloat and other issues. If the cursor is declared WITH HOLD
and the transaction isn't held open you have to pay the cost of materializing and storing a potentially large result set - at least, I think that's how hold cursors work. The alternative is just as bad, keeping the transaction implicitly open until the cursor is destroyed and preventing rows from being cleaned up.
Additionally, if you're using cursors you can't hand connections back to a connection pool. You'll need one connection per client. That means more backend resources are used just maintaining session state, and sets a very real upper limit on the number of clients you can handle with a cursor-based approach.
There's also the complexity and overhead of managing a stateful, cursor-based setup as compared to a stateless connection-pooling approach with limit and offset. You need to have your application expire cursors after a timeout or you face potentially unbounded resource use on the server, and you need to keep track of which connections have which cursors for which result sets for which users....
In general, despite the fact that it can be quite inefficient, LIMIT
and OFFSET
can be the better solution. It can often be better to search the primary key rather than using OFFSET
, though.
By the way, you were looking at the documentation for cursors in PL/pgSQL. You want normal SQL-level cursors for this job.
Do the cursors require that a database connection be left open?
Yes.
Do the cursors run inside a transaction, locking resources until they
are "closed"?
Yes unless they are WITH HOLD
, in which case they consume other database resources.
Are there any other "gotchas" that I'm not aware of?
Yes, as the above should explain.
For HTTP clients, don't use cursors to implement paging. For scalability, you don't want server resources tied up between requests.
Instead, use LIMIT and OFFSET on your queries; see LIMIT
and OFFSET
in the Pg docs.
But make sure that the indexing on your tables will support efficient queries of this form.
Design a RESTful API, so that the client can invoke the "next_url" (also passed in the response) to get the next set of rows.