intertwingly

It’s just data

Planet Hopping


Jacques Distler: I got quite annoyed that the existing software (Venus) was unable to handle my own Atom feed. Apparently, the Universal Feedparser is weak, and easily confused by posts like this one.

Inside the feedparser is the following comment, originally by Mark Pilgrim:

# This will horribly munge inline content with non-empty qnames,
# but nobody actually does that, so I'm not fixing it.

Two comments: this is now a bit of an overstatement, as I’ve addressed a number of the common use cases for svg and mathml.  And it is amazing that we have gotten to 2008 without this being an issue.

That’s the good news.  The bad news is that continued further progress is difficult.  The internal model for the feed parser for content is a serialized string.  Such a string is repeatedly pulled apart using a SGML parser and put back together.  It was the best technology at the time.  Workable, but not ideal for HTML.  Problematic for XHTML.

That’s what inspired me to produce Mars.  Its internal model is a REXML DOM.  Atom feeds with xhtml or text content are directly read into that DOM (ideally using libxml2).  Content that is escaped html utilizes the html5lib parser to produce a DOM.  Further processing (such as sanitization and resolving relative URIs) is done directly on the DOM.

Additional methods are added to the REXML elements to make traversing the DOM as convenient as the feedparser does.  In fact, it goes further and borrows an idea from JavaScript making properties accessible either via hash index or named attribute notation, for example d['feed']['title'] can be more simply expressed as d.feed.title.  Of course, the full REXML methods (including XPath) are also available.

The downside for Mars at the moment is that it’s development has focused on relatively good feeds.  It does have code in place to attempt to parse non-well-formed feeds using bits and pieces that are part of the HTML5 parser, but that’s only lightly tested at this point.  And while it does support a number of the more popular RSS formats out there, it doesn’t attempt to handle Atom 0.3 or some of the more obscure RSS formats.

It also looks like I have a few more patches I can pull from.  This one, in particular, looks interesting.  Apparently, I hadn’t documented harvest adequately, as it should be able to directly address the ERb issue mentioned.