At some point, I’ll likely backport this version to Python. I have an unfinished branch of Venus where all sgmlib processing is replaced with html5lib, and sanitation is done after that.
I would very much like to see a more robust (html5lib) error-correcting parser in the Ruby version.
This is not a pressing issue for me, since, in my case, the parser is consuming well-formed XHTML previously serialized by REXML. But, for more general use, it needs to be able to sanitize arbitrary tag-soup input.
P.S.: Just as a stylistic matter, you might want to
1. Remove the “1” from your quote. In the original, it was a superscript link to a footnote.
2. Link to the blog entry, rather than to the main page of the blog. Entries eventually fade from the main page (though, in my case, perhaps they don’t fade fast enough).
Since you’re making links, you might want to turn unit tests into a link, too. (Also, people might search in vain for XHTML::Node if we don’t tell them where to find it.)
A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests). My approach was slightly different.
[more]