I am working on REST API currently. The resource that the API returns is expected to be huge data from database (tens of millions of row in DB). Pagination is a must in order to avoid enormous memory consumption when writing the data to HTTP response.
How to make sure the data integrity when there is deletion/addition of rows in DB in between client requests?
For example:
page 1: [ John, Mary, Harry, David, Joe ]
page 2: [ Mike, Don, Alex ]
After client requested for page 1 and stored it locally in (file/memory), before asking for page 2, the data is changed to:
page 1: [ John, Mary, Harry, David, **Mike** ]
page 2: [ Don, Alex, **Terry** ]
A true RESTful (and therefore server-side stateless) answer would be:
- ask for the first five records (the last one is "Joe"),
- then ask for the five records superior[1] to "Joe",
- and so on.
With this strategy you'll get "Mike" and "Terry" in page #2.
[1] They must have a sort order (alphabetical or other).
One solution to this is to return a "temporary" resource representing the query result set, and then allow the client to paginate through that using GETs.
For example:
GET /big-query/all-users
Returns: /query-results/12345
GET /query-results/12345?page=1
Returns: users 1-20
GET /query-results/12345?page=2
Returns: users 21-40
The obvious issue with this solution is that changes to the actual users won't be reflected in the query result set, so you should make that clear in your API docs. Also, it would be good to "expire" the result set after a reasonable amount of time to (a) prevent it from going stale and (b) to allow your server to reap the memory it is holding hostage.
The other approach is to re-issue the query each time and then paginate into the result set to find the right chunk of data to return. That is stateless and requires no eviction strategy like the earlier idea, but it does mean that the query will be re-run each time. The good part of it is the results will be as accurate as possible with each pagination.