Dealing with HTML in Feeds

2010-12-30T21:00:18Z

Frédéric Wang: Issue with self closing MathML tags in planet

The problem here is that Frédéric takes the same content that he carefully serves as application/xhtml+xml and places it in Atom feed as HTML. Planet Venus is based on the FeedParser which uses sgmllib to parse HTML content, and sgmllib by design ignores the self closing tag syntax.

There are a few changes Frédéric should consider in order to make his feed consumable by the widest variety of consumers, but the subject of this post focuses on what changes should be made to the feed parser in order to support this case better.

The simplest, and lowest risk, approach is to automatically close mspace, mglyph, msline, none, mprescripts, malignmark, and maligngroup when inside a math element. This process will need to be repeated for SVG which undoubtedly will have a considerably larger number of such elements.

A more comprehensive, and therefore one which simultaneously provides greater benefit and greater risk, is replacing the calls to sgmllib with calls to html5lib. There are two parts to this effort: (a) separate out the usages of sgmllib to parse ill-formed feeds from usages where it is known to be parsing html, and (b) if html5lib is available use dom2sax to produce events that can be mapped to sgmllib equivalents.

Implementing both results in a number of failures which I have sorted by severity, and will describe below:

UnicodeDecodeError simply means that a character with a high order byte made it into a section of code which previously had never encountered such. These are not generally difficult to deal with, and we have a test cases. If this approach is adopted and deployed, others will undoubtedly be encountered in the wild, captured as test cases and fixed.
Stray elements are stripped is simply test case bugs. It is bogus and meaningless to have a <tfoot> element outside of a <table>, so the correct fix is to change the test cases to have <table> elements.
</br> is treated by HTML5 as <br />. Perhaps it shouldn’t be, but that’s what the spec currently says.
illformed line noise is converted into different well-formed data than it was previously. Not a concern.
iframe content is handled differently by html5lib: the net effect is that it becomes escaped. Not a concern in the context of feeds.
xmlns:xlink declarations are moved, and circle elements are closed; either of no concern or actually an improvement.
A number of “crazy” tests are sanitized differently, and (in my opinion) more comprehensively and correctly.
CDATA and DOCTYPES in HTML content are treated as bogus comments and stripped. Good.
Because we are doing a full parse and then inferring what the source might have looked like, character entity references are either mapped into the correct utf-8 encoding or into a canonical form. As an example, &, & and & are all mapped to &. This too is fine.

That’s pretty much it. Not too bad, really, until you realize that there is an effort underway to port the feedparser to Python 3, and efforts to port html5lib to Python 3 appear to have stalled.

Finally, the process of repetitively parsing HTML content into a DOM, producing events from the DOM, looking for simple patterns like href attributes which may need to be resolved, producing a string, and then repeating the process again to do sanitization or microformats or whatever is a bit suboptimal. A better approach would be to convert all HTML once into a DOM and then traverse and scour the DOM as many times as necessary. That’s the design of Mars, which is a more ambitious refactoring of Planet.