Batch!
So GData currently implements a batch update model that is based on a modified Atom feed. Each entry in the feed represents a CRUD operation to be carried out against a collection. Entries can be created, updated, deleted or queried. I’ve been watching the effort for a while now and have been planning to write up some thoughts on the subject but just never seemed to be able to get around to it.
First of all, let’s take a look at an example of a GData batch update:
<?xml version="1.0" encoding="UTF-8"?>
<feed
xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:g="http://base.google.com/ns/1.0"
xmlns:batch="http://schemas.google.com/gdata/batch">
<title type="text">My Batch Feed</title>
<entry>
<id>http://www.google.com/base/feeds/items/13308004346459454600</id>
<batch:operation type="delete"/>
</entry>
<entry>
<id>http://www.google.com/base/feeds/items/17437536661927313949</id>
<batch:operation type="delete"/>
</entry>
<entry>
<title type="text">...</title>
<content type="html">...</content>
<batch:id>itemA</batch:id>
<batch:operation type="insert"/>
<g:item_type>recipes</g:item_type>
</entry>
<entry>
<title type="text">...</title>
<content type="html">...</content>
<batch:id>itemB</batch:id>
<batch:operation type="insert"/>
<g:item_type>recipes</g:item_type>
</entry>
</feed>
If the mere sight of this doesn’t give you shivers and shakes, let me give you a few reasons why it should:
- It’s not valid Atom. Note the first entry in the feed for instance. An Atom entry has an id, a title, some content, an author, some links, maybe some categories, etc. If the type of objects you want to represent does not also have those things, Atom is not the right format to use.
- It only works with Atom. What about binary resources like Jpeg’s? I guess we could base64 encode the binary data and stuff that into our invalid Atom entries but doing so would suck.
- We can’t use Etag’s and conditional requests
- I’m sure there are more reasons but these should be enough to convince you that a better approach is needed.
Let’s set a few ground rules. First, assuming that we need batch updates at all, we need a format that can work with more than just Atom entries. Second, we need a format that does not require us to bastardize Atom or use it for purposes for which it was never intended and is not well suited. Third, we need an approach that does not duplicate the disadvantages of the WS-* model by circumventing key elements of HTTP and REST.
Consider the following:
PATCH /my/atompub/collection HTTP/1.1 Host: example.org Content-Type: multipart/mixed; boundary=batch –batch Content-Id: <batch-1> Batch-Operation: POST /my/atompub/collection Host: example.org Content-Type: application/atom+xml;type=entry <?xml version=”1.0″?> <entry xmlns=”…”>…</entry> –batch Content-Id: <batch-2> Batch-Operation: DELETE /my/atompub/collection/entries/2 Host: example.org If-Match: “ABC123XYZ” –batch–
The PATCH operation is telling the server that the request entity contains a set of instructions for how the server resources are to be modified. The Content-Type value “multipart/mixed” tells the server that the ordering of the parts is significant. Each part represents a single batched HTTP request, complete with headers. The Batch-Operation header represents the HTTP request method and uri. The Content-Id header provides the identifier for each batched request. Notice the each batched request can target a distinct resource.
Now consider the response:
HTTP/1.1 200 OK Content-Type: multipart/mixed; boundary=batch-response --batch-response Content-Id: <batch-1> Content-Type: application/atom+xml;type=entry Batch-Operation: POST /my/atompub/collection Host: example.org Batch-Status: 201 Created Location: /my/atompub/collection/entries/1 ETag: "ABCDEFGH" <?xml version="1.0"?> <entry xmlns="...">...</entry> --batch Content-Id: <batch-2> Batch-Operation: DELETE /my/atompub/collection/entries/2 Host: example.org Batch-Status: 412 Precondition Failed
Each part represents a response to a batched request. The Content-Id correlates with the Content-Id’s used in the request. The Batch-Operation is echoed back for further correlation. The Batch-Status provides the result of the individual operation. Various headers such as ETags can be included in the part, etc.
Note that this approach works well with non-Atom resources as well. For instance, if I want to post multiple images to an Atompub media collection, I could do:
PATCH /my/atompub/collection HTTP/1.1
Host: example.org
Content-Type: multipart/mixed; boundary=batch
--batch
Content-Id: <batch-1>
Batch-Operation: POST /my/atompub/collection
Host: example.org
Content-Type: image/jpg
Slug: Trip to Hawaii
{image data}
--batch
Content-Id: <batch-2>
Batch-Operation: POST /my/atompub/collection
Host: example.org
Content-Type: image/jpg
Slug: Trip to Alaska
{image data}
--batch--
This approach allows us to take advantage of lost-update detection using Etags, allows support for more than just Atom entries, does not require us to compromise the validity of Atom, etc. It is also very similar to pipelined HTTP requests.
In any case, I’ve got an implementation of this almost complete. Feedback is definitely welcome, especially feedback that points out any fundamental flaws.
October 28th, 2007 at 2:34 am
Hi.
I agree with the points you made about GData.
The WebDAV WG has seen requests for batch methods a lots of time. There’s always some discussion about the goals that you want to achieve, namely:
- doing multiple things at once for performance reasons (see CalDAV’s multiget report - http://greenbytes.de/tech/webdav/rfc4791.html#calendar-multiget)
- doing multiple things atomically
Common issues thus are:
- effect on bypassing intermediaries (could HTTP pipelining work?),
- sending back information about partial success (see WebDAV’s multistatus response format).
So is this meant to be atomic?
October 28th, 2007 at 3:43 am
Just as a pragmatic matter, what are you using on the backend to implement the PATCH method? Existing Java servlet containers will vomit on random methods (unless you reimplement service(req, resp))… Have you addressed this elsewhere?
October 28th, 2007 at 8:08 am
Jonathan: Overriding service is the only way to make it work.
October 28th, 2007 at 9:02 am
Julian, the most important question that needs to be resolved is whether this kind of batching is even needed in the first place.
Regarding HTTP Pipelining, it is effective but it specifically rules out pipelining non-idempotent methods whereas the main purpose of gdata’s mechanism is to allow batching of unsafe, non-idempotent operations.
Regarding multistatus responses, the main challenge I have with webdav’s approach is that while it works, I don’t understand why the http header/entity model needs to be refactored as XML. The mime examples I give here demonstrate that such refactoring is not required. Compared side-by-side, the mime examples above are arguably simpler than the webdav multistatus equivalent.
Regarding whether or not the operation should be atomic, I’m not convinced we should go that far. The response needs to tell us what batched requests succeeded and which failed; the client can compensate accordingly.
Something else to keep in mind is the fact that, the way the mime batch is put together, it would be entirely possible and fairly simple for an intermediary proxy receiving a batch request to split it up into multiple, separate requests to one or more origin servers. Caching, pipelining, etc could still be used. Each batched request could even specify it’s own Authorization, negotiation, etc.
October 29th, 2007 at 3:50 am
James,
> Julian, the most important question that needs to be resolved is whether this kind of batching is even needed in the first place.
Agreed.
> Regarding multistatus responses, the main challenge I have with webdav’s approach is that while it works, I don’t understand why the http header/entity model needs to be refactored as XML. The mime examples I give here demonstrate that such refactoring is not required. Compared side-by-side, the mime examples above are arguably simpler than the webdav multistatus equivalent.
Sorry for the misunderstanding — I just was pointing to that format as example for prior work.
WRT to parsing the response: with your proposal (at least in Java) you will need a custom text-based parser for the responses unless I’m missing some way to re-use servlet container functionality. This is definitively not trivial.
> Regarding whether or not the operation should be atomic, I’m not convinced we should go that far. The response needs to tell us what batched requests succeeded and which failed; the client can compensate accordingly.
OK, in which case this is only about performance. In which case I’m even more sceptical it’s needed.
Best regards, Julian
October 29th, 2007 at 6:57 am
A custom text parser is not required. A simple mime parser works (e.g. mime4j) and is generally less complicated than dealing with xml. The main issue on this point, however, comes down to why would xml be needed at all?
> OK, in which case this is only about performance. In which case I’m even more sceptical it’s needed.
Ditto. I look at this as more of an exercise. There are folks who are convinced that batching is required. I’m not one of those people. If it turns out that I’m wrong, however, what would the proper solution look like?
October 30th, 2007 at 1:32 am
> A custom text parser is not required. A simple mime parser works (e.g. mime4j) and is generally less complicated than dealing with xml. The main issue on this point, however, comes down to why would xml be needed at all?
Well, one answer would be: why not? Everybody already has an XML parser around. For instance, it’s part of the JDK, while a standalone MIME parser is not, right?
Also keep in mind that although HTTP messages resemble MIME messages, there are some subtle differences (http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.19.4), so reusing a MIME parser may not work as painless as you think.
Best regards, Julian