intertwingly

It’s just data

Dare Takes a Look at CouchDB


Dare Obasanjo: Recently I took a look at CouchDB because I saw it favorably mentioned by Sam Ruby and when Sam says some technology is interesting, he’s always right

Dare’s review of CouchDB is worth a read.  (Update: so are Assaf Arkin's and Damien Katz's responses)  He gets more things right than wrong.  And he doesn’t get things wrong so much as he has a tendency to make unqualified statements that need to be qualified.  Like statements that things that are interesting to me tend to be interesting to Dare (but to my kids?  Not so much).  Another example:

One thing that not so interesting is that editing documents is lockless and utilizes optimistic concurrency which means more work for clients.

That’s definitely a statement that requires qualification.  Perhaps one like this one:

Document oriented database work well for semi-structured data where each item is mostly independent and is often processed or retrieved in isolation.

While that is a good qualification, it errs on being a bit too restrictive, particularly when Dare follows up with:

However there are also lots of Web applications that are about managing heavily structured, highly interrelated data (e.g. sites that heavily utilize tagging or social networking) where the document-centric model doesn’t quite fit.

Prior to the web, most hypertext theory centered around bidirectional links and fixed schemas.  Approaches that wouldn’t scale and don’t easily evolve.  By contrast, the web is made up of sites that independently update so that Dare can post to his web site without requiring anything like a lock that would affect me posting to mine.  Both of our sites enable comments, so there is a limited ability for others to post things, but this tends to operate at such a low rate (dozens of updates per day) that optimistic concurrency isn’t much of an issue.

And yet search engines like Google and approaches like map/reduce show that such sites can be reasonably indexed.

Concrete example, from the social networking space.  My Facebook profile could be viewed as a document.  One with one primary author and limited abilities for others to modify it.  And yet things like the News Feed could easily be produced by a map/reduce job.  In parallel.  Across a large cluster of commodity machines and is highly scalable manner.

To get a perspective on why this is important, consider that I started looking at this from the other side.  What happens when your application grows so large that you have no choice but to massively employ techniques like sharding?  What do you have to give up?  What do you need to add back in in order to mitigate the loss of the things you give up?

At a certain point, referential integrity has to be given up.  Scale a bit further, and even the notion of a relation in the relational database sense of the word starts to break down.  To cope, you denormalize a bit, not so much for performance reasons (though that’s important too), but as a self defense mechanism so that the pieces of data that you do have have enough context to be meaningful.

What replaces a Department table in a typical Company/Employees database (or one that identifies a Group in a Facebook like appication) when faced with the prospects of mega-sharding?  The CouchDB answer is views, ones that are computed by map/reduce jobs that essentially extracts (or maps) “tags” and “social relations” from profile documents and reduces them into documents of their own right.

This leads to

although focusing on JSON instead of XML makes it buzzword compliant

Dare, you say this like it was a bad thing :-).  Do you really want to continue to program with Circles, Triangles, and Rectangles?  Or would you rather your program looks something like this?

And then to:

and is definitely not a replacement/evolution of relational databases

What I have come to realize is that the very things that make J2EE and Relational Databases suitable for Enterprise scale applications are the very things that act as road bumps on workgroup scale and on web scale applications.  Simply put, relational databases will get squeezed on both sides.

Footnote: as I was writing this, I saw Chuck Vose's take.  His first bet takes some of my thinking to its logical conclusion, but he doesn’t yet see what I see in CouchDB.  Perhaps this post will help shed some light on why I think CouchDB is in line with my other bets.  His second bet goes off the rails [heh] a bit with:

And I realize that this is all possible in the REST model, but it makes the controllers obscene sometimes.

I’d like to put forward another possibility.  While DHH is enamored of REST (and I deserve a small bit of the “blame” for that) his views on “stored procedures” is widely known (search for “Choose a single layer of cleverness").  Perhaps the map/reduce abstraction might just cause him to give a little on the latter in order to maintain the former.

A closing thought: couch.ini talks about a "JsServer”, but in reality any language that can evaluate a view, read from stdin, write to stdout, and produce and consume JSON could be used.