intertwingly

It’s just data

Xhtml5lib


While there unquestionably are a lot of applications of XML for which strict, draconian, error handing is appropriate, there also are a number of use cases for which robust scavenging is required, as is evidenced by the popularity of libraries such as Beautiful Soup and the Universal Feed Parser.  I’ve even done likewise for OPML.

HTML5’s grammar is a rich a blend of SGML (the common ancestor to both HTML and XML), XML, and custom parsing rules; these rules were arrived at by observing the effective consensus that browser vendors have converged on in the process of dealing with the enormous diversity of documents that exist on the Internet; documents often produced either by hand editing or by copy/pasting portions of templates.

Much of that experience can directly benefit those that find themselves in need of recovering data from mal-formed XML at any cost, particularly for the XML documents which are produced using similar hand editing, copy/pasting, and templating techniques that are used to produce invalid HTML.  Additionally, given the rough similarity between HTML and XML syntax, naïve users will often copy things that happen to work in HTML into XML documents.

For these reasons, it should be of no surprise that only some relatively small adaptations to the existing html5lib tokenizer and html5parser are needed to support an XML/XHTML scavenger libraryWith tests.

Just be aware that in scavenge mode, some data will be interpreted in a manner different than the author intended, as such intent can’t be determined.  Also be aware that some of the more advanced XML features that are less commonly used in hand-produced XML, like internal DTD subsets, are not supported by this process.  For this reason, it is recommended that data first be parsed by a “real” XML parser and this logic only be used as a fallback.

References: