intertwingly

It’s just data

JSON for Map/Reduce


James Snell: Abdera has always included the ability to serialize Atom entries to JSON. The mapping, however, was not all that ideal. So I rewrote it. The new serialization is VERY verbose but covers extensions, provides better handling of XHTML content, etc. I ran my initial try by Sam Ruby who offered some suggested refinements and I made some changes. The new output is demonstrated here (a json serialization of Sam Ruby’s blog feed). The formatting is very rough, which I’ll be working to fix up, but you should be able to get the basic idea.

Based on the comments, Patrick and Elias do not seem amused.  Guys, I’ve got a use case in mind, and I wonder if you wouldn’t mind helping me?

Imagine I have a database designed from the ground up for JSON.  One where incremental map/reduce jobs replace queries.  The data I plan to put in that database is from feeds: RSS 1.0, RSS 2.0, Atom 0.3, whatever; I don’t care.  With the components that go into Venus (UFP, HTML5LIB, and reconstitute) I can do a LOT of normalization.  Which is good, because I’d like to do all the normalization I can once, so that the subsequent map/reduce tasks can focus on more on the problem they are trying to solve and less on the syntax.

The map/reduce jobs will typically be written in JavaScript.  By that I mean what you get when you apt-get install spidermonkey-bin and run from the command line, and not what you get whey you run within Firefox.  If you like, other languages could be substituted, if a strong enough case could be made.

The set of potential microformats is unbounded.  I’d like to be ready to handle microformats that haven’t been invented yet.  But to provide some specifics to this use case, lets consider hCalendar.  It contains dates and locations.

First, do all three of you agree that this is a reasonable use case?  If so, what would the ideal JSON format be for this case?  Remember, I’m willing to throw virtually unlimited resources at the one time pre-normalization step in the hopes that such efforts can help shave microseconds off of subsequent map tasks.

Again, to keep this grounded, try to sketch out the code for the map task.  Input is a key and a single JSON document, output is an array of [[dtstart, dtend, location], key].  I don’t care if the locations were originally in summary, content, description, content:encoded, or even title elements, I simply want the data.