intertwingly

It’s just data

Yet Another Planet Refactoring


A little over a month ago, I outlined how I would like to see the feed parser reorganized.  I’ve now put a little meat on the bones, in the form of running code.  Not just for the feed parser, but also for Planet.  I also did it all in Ruby, so I named this little experiment Mars.  Warning: this version is 0.0.1.  It just barely runs end-to-end.  Feed it real data, and it will choke on some of it.  But it can now produce partial results.

Inventory:

config
To keep things compatible, I ported the parsing logic from Python’s ConfigParser to Ruby.  Once parsed, the results are accessed as a Hash.  Eventually, I’ll provide Planet specific logic, like defaults, in this module
fido
Fido fetches (get it?) feeds.  It caches, follows redirects, times out, handles HTTP status codes, and compression/zip.  It also distributes work across multiple threads.
xmlparser
This module will use one of four XML parsers, and return the result as a REXML document.  If installed, it will use the fast and standards compliant expat or libxml2 parsers.  If neither are installed, it will use the slower and less compliant (to this day, it still parses <a>&a</a> and <a a='<'/> without error).  In all REXML isn’t too bad... as long as you don’t depend on it for serialization or deserialization or XPath or expect quick turnaround on bug fixes or responses on their mailing list.  In the event the chosen parser fails to parse the document, the HTML5lib liberal XML parser will be used, and a bozo flag will be set on the document itself.
Transmogrify
With RSS, there often are several ways to express the same concept (shades of TMTOWTDI?).  Atom aspires to Python’s philosophy of There should be one—and preferably only one—obvious way to do it.  This module is clearly opinionated software in that it will transmogrify feeds which use less obvious constructs into more obvious ones.
sift
Sift will filter out impurities and break down HTML into elements than can be iterated over.
forgiving_uri
This is from Bob Aman.  I need to look into Addressible (also from Bob Aman) to see if it is a better alternative.
spider
Spider orchestrates the retrieval and sifting of feeds, breaks the results into a set of entries, adds in source information, and caches the results.
splice
Splice will select the latest entries and produce a feed, and then process this information using a user supplied list of templates.  At the moment, the only templating language supported is XSLT.
style
If libxslt is installed, it will be used, otherwise an attempt is made to shell out to xsltproc.
harvest
Not yet integrated into the Mars proper, but used to drive the testing (and therefore the development) to date, this function dynamically constructs dictionaries from an Atom document.  This will undoubtedly be useful for other templating languages.  As everything is constructed dymanically, multiple date formats, mutiple serialization formats (think: dropping superfluous quotes on attributes and pesky things like explit paragraph end tags for HTML5 purists), and multiple aliases for any given element are no problem.

Also provided

planet
A small main program to drive the execution.
reconstitute
A demo program which will enable you to see what the parsed, transmogrified, and sifted version of any given feed looks like
test/feedparser
Runs a small portion (currently wellformed/(atom10|rss)/*.xml) of the feedparser test suite.  Check the comments to see what is not yet supported (mostly elements like cloud and textInput

All in all, I’m pleased with how compact this code is.  If anybody wants to join in on the fun, it is available as a bzr repository and there are plenty of test cases ready to be ported.