Tests I’d Like CouchDB to Pass

2007-09-24T18:20:05Z

First an update on Basura, as reported by the couch_tests.js:

Test passing:

Testing basic functionality
Testing conflicts
Testing lots of docs
Testing multiple rows
Testing utf8

Tests yet to pass (or some cases, even attempted):

Testing design documents
Testing attachments
Testing view collation

So, Basura, other than being a piece of trash, is starting to get functional. While it doesn’t yet pass all the CouchDB tests, it does pass some tests that I’d like to see CouchDB pass. These tests are the subject of this post.

Content Encoding Tests

This is real low hanging fruit. JSON can be a mildly verbose text based format, and therefore is amenable to standard compression techniques. In the HTTP GET case, this can be implemented without affecting any existing clients. The way it works is that the user agent indicated, via an Accept-Encoding header what types of compression techniques it supports, and the server sends back a Content-Encoding header with its response indicating which compression method it chose. In the default case, neither sends such a header, and the data is sent back uncompressed.

As an indication of how transparent and widespread this is, XMLHttpRequest on Firefox handles this automatically, which means that the couch_tests.html now run compressed. Yea!

Neither the server support nor the unit tests for compression on PUT and POST are implemented yet. As this involves compression of the requests themselves, it is a bit presumptuous for clients to assume that the server supports the compression technique chosen, so this technique is less often used in the wild. Still, this should ultimately be implemented as there likely will be some clients that would trade off a bit of coupling for bandwidth savings.

Content Negotiation Tests

This feature is of less value, but it takes virtually nothing to implement.

HTTP, as a protocol, is not based on the presumption that clients and servers must be layered similarly in order to meaningfully interact. To the contrary, any given server should be able to interact with a wide range of user agents.

A generic application, like a browser, may not understand a given data format. Firefox, for example, currently handles application/xml automatically, but does not handle application/json. For such an application, sending responses as text/plain may help with debugging.

A CouchDB specific library already “knows” what data format it is expecting, and needs no additional hints. For such applications, the Content-Type may not matter much.

A generic URI library may be able to provide additional value add if it knows the content type it is dealing with. For example, with JSON, it could pre-load the data.

The way this is supposed to work is similar to the Accept-Encoding header above: an Accept header is sent indicating what content types are supported and preferred, and the server can use this information to adjust its response.

Again, the sole purpose of such a feature is to enable serendipity. If/when a browser comes out that supports JSON to the same level that modern browsers support XML, Basura will be ready. Additionally, Basura specific libraries (ha!) could use this information as a sanity check of responses before attempting to JSON decode them. Yes, some mis-configured servers have been known to send back 200 OK responses with associated text that contradicts this status. Checking the Content-Type of the response doesn’t absolutely guarantee anything, but it does tend to increase the amount of number of times that a library can produce a more meaningful error message.

Conditional Processing Tests

This feature can potentially produce the biggest benefit, though it will require a bit of a change to the existing CouchDB interfaces. Hopefully it is early enough in the development that these changes can be considered.

As with everything else HTTP related, it involves exposing a bit more metadata in HTTP headers.

For starters, there is an ETag which is a string that is guaranteed to change every time the resource does. CouchDB revision IDs are perfect for this. As an aside, clients are discouraged from making any assumptions about the values of ETags. For this reason, a case could be made for “lightly encoding” (say, ROT13 or equivalent?) these values. On the other hand, exposing the revision IDs in this matter makes debugging a bit easier.

Secondly, there is a Last-Modified header that can be used for basically this same purpose. Given its granularity of a second, It is inferior to ETags but as some clients are happy enough with it, servers should support it.

These two values can be combined with a variety of headers to essentially assert a number of preconditions: If-Match, If-Modified-Since, If-None-Match, and If-Unmodified-Since.

The big benefit that is available to all is that GET can return 304 Not Modified. This can significantly reduce bandwidth on resources that are mercilessly polled. Furthermore, if the Etag value is chosen so that its currency can be determined entirely by the index, both CPU and memory usage can be reduced.

A secondary benefit that will only accrue to some, but by adhering to standards, people who build on libraries that implement those standards will have less code to write. As code that deals with concurrency issues tends to be error prone, this can be a big win.

lostupdate.py is an example using httplib2.