intertwingly

It’s just data

Comparing Purifiers


Mark Pilgrim: <http://htmlpurifier.org/> is the one to beat.  It passes these tests: <http://htmlpurifier.org/live/smoketests/xssAttacks.php>

I’m not convinced.  Perhaps by comparing this to html5lib's sanitizer (which has it’s own set of tests, BTW), both can be improved.

Both handle valid input.  Both gracefully recover from bad input.  Both can produce well formed output.

HTMLPurifier’s error handling is based on what the HTML4 spec allows.  html5lib’s error handling is based on what browsers actually do.  As a concrete example of the difference, consider the following:

<table><tr><td>x</td></tr> here's some text

Now, text is not allowed there per the spec, so it gets purified away.

If you are asking yourself, “but why would anybody put text there?”, you simply are asking the wrong question.  The right question to be asking is: “but who would forget to close a table?”.  The latter happens all the time.  In fact, here is an example.  That page being exactly the one that one would need to screen scrape in order to determine what HTMLPurifier’s white lists actually are.

Luckily, html5lib can handle such data with ease.  This program produces this output.

Update: This new version reflects the fact that HTMLSanitizer whitelists background border margin, padding, and -* variants thereof.  (updated output).

Now for the interesting question: how should this page be updated?  Any takers?