UserPreferences

PaceReturnCollNewestToOldest


Abstract

A technique is given for supporting synchronization of Atom collections between client and server. The technique is intended to be easy to implement, and robust (in the face of simultaneous updates).

The Pace specifies a way for a client to find out, through a series of requests, "What items have changed since time X?" Even if the collection is independently updated during the series of requests, no duplicates will appear in the transmission, and omitted items are minimized.

Status

Draft

Problem

The Atom protocol allows for the retrieval of large data collections in pieces, yet clients and servers may have (different) limits for how many items can be returned in a single request. What is needed is an expectation for which items will be returned in a request, if not all items will be. This way a series of requests can efficiently fetch all of the items that a client needs to see, eliminating or minimizing duplicated and omitted items.

This pace specifies behavior that supports efficient syncing between client and server. See also SyncIsTheParadigm.

Normative Text

(to go in a future section regarding Atom collections)

X.Y.Z1 Modified-Range Collection Request

An Atom server may offer any of several collections, each of which is retrievable at some URL. A GET request to a collection MAY include a Modified-Range HTTP header. Either the range start or the range-end MAY be omitted by the client, in which case it is taken to be -inf or inf, respectively. If the header is omitted, the requested range is taken as (-inf, inf).

X.Y.Z2 Collection Subset Response

In response to a collection request, a server MAY return a subset of the items within the requested range. The items returned MUST constitute the most recently modified items within the range. There must be no other item within the same collection which was last modified within the given range and which was modified at a moment more recent than any of the items in the result set.

If the server returns a subset of the items falling within the requested range, it must return an HTTP response code of 2xx. [@@TBD@@ Determine response code]

Each collection item MUST include a <modified> element giving the last time at which a change was made to the resource specified by the item, in ISO 8601 format.

Design Rationale

One useful client algorithm uses the Modified-Range header to perform efficient synchronization. The algorithm is given, as input, a time range over which to synch, which may be (-inf, inf). It takes the following steps:

  1. Request the collection over the given range; add or update the resulting items in the client's local copy of the collection.

  2. If the response was 200 OK, finish.

  3. Otherwise, find the oldest atom:modified value in the result; call this "oldest".

  4. Repeat from step 1, but form a new time range as the intersection of the current range with (-inf, oldest). [In other words, shorten the range to exclude anything newer than "oldest".]

Example: In this scenario, the server will refuse to return more than three items at a time. Also assume that this client, Calvin (C), begins the sync operation at October 31, 2004, 19:29 UTC, and client's last sync operation was begun at September 1, 2004, 16:15:00 UTC. (Note that the beginning and end of a date range are separated by a slash.)

In this example, a custom syntax is used, but the protocol is independent of the XML representation of the collection--for example, the syntax of WebDAV's PROPFIND returns could be used. Likewise, PROPFIND could be used as the HTTP method without hampering the proposed protocol.

First request:
  GET /collections/categories HTTP/1.1
  Modified-Range: 2004-09-01T16:15:00Z/2004-10-31T19:29:00Z
  .....
  HTTP/1.1 2xx Most-Recent-Subsequence Result
  Content-type: text/xml; charset="utf-8"
  Content-Length: xxx
  
  <?xml version="1.0" encoding="utf-8" ?>
  <collection xmlns="http://purl.org/atom/ns">
    <subject value="...."><title>Planes</title>
      <subject>2004-10-30T12:01:00Z</modified></subject>
     <subject value="...."><title>Trains</title>
      <subject>2004-10-15T23:45:00Z</modified></subject>
    <subject value="...."><title>Automobiles</title>
      <subject>2004-10-28T09:18:00Z</modified></subject>
  </collection>
Second request:
  GET /collections/categories HTTP/1.1
  Modified-Range: 2004-09-01T16:15:00Z/2004-10-15T23:45:00Z
  ......
  HTTP/1.1 2xx Most-Recent-Subsequence Result
  Content-type: text/xml; charset="utf-8"
  Content-Length: xxx
  
  <?xml version="1.0" encoding="utf-8" ?>
  <collection xmlns="http://purl.org/atom/ns">
    <subject value="...."><title>Triremes</title>
      <modified>2004-10-12T16:03:00Z</modified></subject>
    <subject value="...."><title>Canoes</title>
      <modified>2004-10-12T16:03:00Z</modified></subject>
    <subject value="...."><title>Motorboats</title>
      <modified>2004-10-10T13:00:00Z</modified></subject>
  </collection>
Third request:
  GET /collections/categories HTTP/1.1
  Modified-Range: 2004-09-01T16:15:00Z/2004-10-10T13:00:00Z
  ......
  HTTP/1.1 200 OK
  Content-type: text/xml; charset="utf-8"
  Content-Length: xxx
  
  <?xml version="1.0" encoding="utf-8" ?>
  <collection xmlns="http://purl.org/atom/ns">
    <subject value="...."><title>Bicycles</title>
      <modified>2004-09-01T16:15:00Z</modified></subject>
    <subject value="...."><title>Tricycles</title>
      <modified>2004-10-01T08:47:00Z</modified></subject>
  </collection>

In this example the client needed three requests to get the entire collection "/collections/categories", because the server was stingy and would only return three items in each request.

Let's look at how this approach fares in the face of concurrent reads and writes. Suppose this whole cycle took a full 60 seconds to complete, from October 31, 2004, 19:29:00 to October 31, 2004, 19:30:00. Also suppose that at October 31, 2004, 19:29:15, another client, Hobbes, modified one item, an already-existing subject called "Go-carts" with identifier "/collections/categories/gocarts". As you can see above, this item was not returned by the server, since its modification date did not fall within any of the ranges at the time the request was made. Thus, at the end of the sync operation, Calvin will not have seen the change to "/collections/categories/gocarts", and the data which Calvin will display for that identifier will be the data that was fetched during the previous sync operation (the one begun 2004-09-01T16:15:00Z).

A second sync operation on the part of Calvin would normally pick up the lost item, "/collections/categories/gocarts". It is conceivable, though unlikely, that a sequence of sync operations could consistently omit some item. If the item is updated during every one of Calvin's sync operations, that item will never appear in any response.

However, the algorithm is robust against duplicates: using the above algorithm, no item will appear twice during a single sync operation, nor will an item be transmitted if it hasn't changed since Calvin's previous sync.

Prior Art

The principle of fetching recent items has precedent in this space.

The MetaWeblog API offers a getRecentPosts method as its sole means of discovering posts. This operation always returns the number of posts requested by the client, and they are always the most recent posts.

For a client that's syncing its state with a server, the question is "How many (or which) items should I fetch? Which ones have changed?" To that end, what would be most useful is an ability to request entries within a specific range of modification times.

Notes

This proposal is (hopefully) independent of the syntax for representing (e.g.) categories. The custom syntax above is merely for demonstration.

(flesh out)


CategoryProposals