It’s just data

Publishing a Blog From a mod_atom Store

Seth Gordon: Planet ( was designed to crawl all the feeds on the blogroll and produce some appropriately formatted HTML page with all their contents; you could just set it up so it only read your own blog’s mod_atom feed, make some appropriate template, and voila!

That would certainly cover the front page, but that’s about it.

Fortunately, there are bits and pieces that cover the rest.  I’ve contributed heavily to Planet, the Universal Feed Parser, and html5lib, and maintain what effectively is the only active development branch of Planet at this point, which I call Venus.  As Venus has been refactored, it is easier to discuss this in terms of Venus’s architecture than of Planet’'s.

Venus has been split into two phases, Spider which fetches the data, and Splice which selects and formats entries.  They communicate by means of an Atom store.  Let’s look at each in turn.

The output of all this is placed on disk, one file per entry.  At this point, it is worth considering the internal data format of Tim’s mod_atom, where all data placed on disk, one file per entry.  Hmmm...  Atom Store!

The bazillion feed formats issue is a non-issue here, nor is the eight ways to specify an author name, nor is the seemingly endless creative ways in which people seem to misuse RFC 822 formatted dates; all that remains as an unaddressed issue is the cleansing of the HTML.  In terms of this diagram, that simply means that html5lib needs to shift from the left to the right, and Spider is no longer necessary.

Now, lets look at that right hand side.  Splice is brain dead simple.  It reads a sets of entries, concatenates them into a feed, and then sends that feed to the template engine of your choice.

It actually is simple enough that I don’t believe that there actually will be any code worth reusing.  If you are producing your web site dynamically, you need a controller that parses the URI to determine which file(s) to read off of disk, parse those files (an XML parser will do just fine here), sanitize the HTML (again, all you need is in html5lib), resolve relative URIs, and then pass the output through a template of your choice.

If you are generating your website statically, you do basically the same thing, but place the output on disk instead.

Oh, and did I mention that html5lib was available in two languages: Python and Ruby?

But enough with hand-waving.  Time for some real code.  Checkout thisDownload this.  Tailor two lines.  And then:

eruby atompub.rhtml

Joe can port it to Python in 10 minutes.  Steve to JavaScript in 20 hours or so.  Prefer Java?  C#?  Perl?  Go for it!