I'm interested in exposing a direct REST interface to collections of JSON documents (think CouchDB or Persevere). The problem I'm running into is how to handle the GET
operation on the collection root if the collection is large.
As an example pretend I'm exposing StackOverflow's Questions
table where each row is exposed as a document (not that there necessarily is such a table, just a concrete example of a sizable collection of 'documents'). The collection would be made available at /db/questions
with the usual CRUD api GET /db/questions/XXX
, PUT /db/questions/XXX
, POST /db/questions
is in play. The standard way to get the entire collection is to GET /db/questions
but if that naively dumps each row as a JSON object, you'll get a rather sizeable download and a lot of work on the part of the server.
The solution is, of course, paging. Dojo has solved this problem in its JsonRestStore via a clever RFC2616-compliant extension of using the Range
header with a custom range unit items
. The result is a 206 Partial Content
that returns only the requested range. The advantage of this approach over a query parameter is that it leaves the query string for...queries (e.g. GET /db/questions/?score>200
or somesuch, and yes that'd be encoded %3E
).
This approach completely covers the behavior I want. The problem is that RFC 2616 specifies that on a 206 response (emphasis mine):
The request MUST have included a Range header field (section 14.35) indicating the desired range, and MAY have included an If-Range header field (section 14.27) to make the request conditional.
This makes sense in the context of the standard usage of the header but is a problem because I'd like the 206 response to be the default to handle naive clients/random people exploring.
I've gone over the RFC in detail looking for a solution but have been unhappy with my solutions and am interested in SO's take on the problem.
Ideas I've had:
- Return
200
with aContent-Range
header! - I don't think that this is wrong, but I'd prefer if a more obvious indicator that the response is only Partial Content. - Return
400 Range Required
- There is not a special 400 response code for required headers, so the default error has to be used and read by hand. This also makes exploration via web browser (or some other client like Resty) more difficult. - Use a query parameter - The standard approach, but I'm hoping to allow queries a la Persevere and this cuts into the query namespace.
- Just return
206
! - I think most clients wouldn't freak out, but I'd rather not go against a MUST in the RFC - Extend the spec! Return
266 Partial Content
- Behaves exactly like 206 but is in response to a request that MUST NOT contain theRange
header. I figure that 266 is high enough that I shouldn't run into collision issues and it makes sense to me but I'm not clear on whether this is considered taboo or not.
I'd think this is a fairly common problem and I'd like to see this done in a sort of de facto fashion so I or someone else isn't reinventing the wheel.
What's the best way to expose a full collection via HTTP when the collection is large?
You can detect the
Range
header, and mimic Dojo if it is present, and mimic Atom if it is not. It seems to me that this neatly divides the use cases. If you are responding to a REST query from your application, you expect it to be formatted with aRange
header. If you are responding to a casual browser, then if you return paging links it will let the tool provide an easy way to explore the collection.I think the real problem here is that there is nothing in the spec that tells us how to do automatic redirects when faced with 413 - Requested Entity Too Large.
I was struggling with this same problem recently and I looked for inspiration in the RESTful Web Services book. Personally I don't think 206 is appropriate due to the header requirement. My thoughts also led me to 300, but I thought that was more for different mime-types, so I looked up what Richardson and Ruby had to say on the subject in Appendix B, page 377. They suggest that the server just pick the preferred representation and send it back with a 200, basically ignoring the notion that it should be a 300.
That also jibes with the notion of links to next resources that we have from atom. The solution I implemented was to add "next" and "previous" keys to the json map I was sending back and be done with it.
Later on I started thinking maybe the thing to do is send a 307 - Temporary Redirect to a link that would be something like /db/questions/1,25 - that leaves the original URI as the canonical resource name, but it gets you to an appropriately named subordinate resource. This is behavior I'd like to see out of a 413, but 307 seems a good compromise. Haven't actually tried this in code yet though. What would be even better is for the redirect to redirect to a URL containing the actual IDs of the most recently asked questions. For example if each question has an integer ID, and there are 100 questions in the system and you want to show the ten most recent, requests to /db/questions should be 307'd to /db/questions/100,91
This is a very good question, thanks for asking it. You confirmed for me that I'm not nuts for having spent days thinking about it.
One of the big problems with range headers is that a lot of corporate proxies filter them out. I'd advise to use a query parameter instead.
You can still return
Accept-Ranges
andContent-Ranges
with a200
response code. These two response headers give you enough information to infer the same information that a206
response code provides explicitly.I would use
Range
for pagination, and have it simply return a200
for a plainGET
.This feels 100% RESTful and doesn't make browsing any more difficult.
Edit: I wrote a blog post about this: http://otac0n.com/blog/2012/11/21/range-header-i-choose-you.html
I don't really agree with some of you guys. I've been working for weeks on this features for my REST service. What I ended up doing is really simple. My solution only makes a sense for what REST people call a collection.
Client MUST include a "Range" header to indicate which part of the collection he needs, or otherwise be ready to handle a 413 REQUESTED ENTITY TOO LARGE error when the requested collection is too large to be retrieved in a single round-trip.
Server sends a 206 PARTIAL CONTENT response, with the Content-Range header specifying which part of the resource has been sent, and an ETag header to identify the current version of the collection. I usually use a Facebook-like ETag {last_modification_timestamp}-{resource_id}, and I consider that the ETag of a collection is that of the most recently modified resource it contains.
To request a specific part of a collection, the client MUST use the "Range" header, and fill the "If-Match" header with the ETag of the collection obtained from previously performed requests to acquire other parts of the same collection. The server can therefore verify that the collection hasn't changed before sending the requested portion. If a more recent version exists, a 412 PRECONDITION FAILED response is returned to invite the client to retrieve the collection from scratch. This is necessary because it could mean that some resources might have been added or removed before or after the currently requested part.
I use ETag/If-Match in tandem with Last-Modified/If-Unmodified-Since to optimize cache. Browsers and proxies might rely on one or both of them for their caching algorithms.
I think that a URL should be clean unless it's to include a search/filter query. If you think about it, a search is nothing more than a partial view of a collection. Instead of the cars/search?q=BMW type of URLs, we should see more cars?manufacturer=BMW.
Seems to me that the best way to do this is to include range as query parameters. e.g., GET /db/questions/?date>mindate&date<maxdate. Upon a GET to the /db/questions/ with no query parameters, return 303 with Location: /db/questions/?query-parameters-to-retrieve-the-default-page. Then provide a different URL by which whomever is consuming your API to get statistics about the collection (e.g., what query parameters to use if s/he wants the entire collection);