Publishing a Blog From a mod_atom Store
Seth Gordon: Planet (http://www.planetplanet.org/) was designed to crawl all the feeds on the blogroll and produce some appropriately formatted HTML page with all their contents; you could just set it up so it only read your own blog’s mod_atom feed, make some appropriate template, and voila!
That would certainly cover the front page, but that’s about it.
Fortunately, there are bits and pieces that cover the rest. I’ve contributed heavily to Planet, the Universal Feed Parser, and html5lib, and maintain what effectively is the only active development branch of Planet at this point, which I call Venus. As Venus has been refactored, it is easier to discuss this in terms of Venus’s architecture than of Planet’'s.
Venus has been split into two phases, Spider which fetches the data, and Splice which selects and formats entries. They communicate by means of an Atom store. Let’s look at each in turn.
First Spider fetches your feed. If you select the right options, it will make use of httplib2, which in general I highly recommend, but in this scenario the data is already on disk, so it isn’t necessary.
The Universal Feed Parser accomplishes multiple things, most notably:
It handles multiple feed formats, date formats, and ill formed feeds. Just one concrete example to illustrate the point: the author name can be obtained from one of eight different places in your feed; and the Universal Feed Parser even handles cases where people simply put names in place of email addresses or tack on names as comments after an email address.
It partially cleans the HTML. It uses SGMLLIB to clean up the tokens, then it removes unsafe constructs (like plaintext), and resolves relative URIs.
html5lib completes the HTML cleansing. Truth be told, it has a better tokenizer and a better sanitizer than the one in the feed parser, but for the moment all Venus uuses it for is parsing. The output of this phase is unfailingly well formed.
reconstitute reconstructs an element from the feed parser data.
The output of all this is placed on disk, one file per entry. At this point, it is worth considering the internal data format of Tim’s mod_atom, where all data placed on disk, one file per entry. Hmmm... Atom Store!
The bazillion feed formats issue is a non-issue here, nor is the eight ways to specify an author name, nor is the seemingly endless creative ways in which people seem to misuse RFC 822 formatted dates; all that remains as an unaddressed issue is the cleansing of the HTML. In terms of this diagram, that simply means that html5lib needs to shift from the left to the right, and Spider is no longer necessary.
Now, lets look at that right hand side. Splice is brain dead simple. It reads a sets of entries, concatenates them into a feed, and then sends that feed to the template engine of your choice.
It actually is simple enough that I don’t believe that there actually will be any code worth reusing. If you are producing your web site dynamically, you need a controller that parses the URI to determine which file(s) to read off of disk, parse those files (an XML parser will do just fine here), sanitize the HTML (again, all you need is in html5lib), resolve relative URIs, and then pass the output through a template of your choice.
If you are generating your website statically, you do basically the same thing, but place the output on disk instead.
Oh, and did I mention that html5lib was available in two languages: Python and Ruby?
But enough with hand-waving. Time for some real code. Checkout this. Download this. Tailor two lines. And then:
eruby atompub.rhtml
Joe can port it to Python in 10 minutes. Steve to JavaScript in 20 hours or so. Prefer Java? C#? Perl? Go for it!