It’s just data

In defence of Polyglot

I see that Henri Sivonen is once again being snarky without backing his position.  I’ll state my position, namely that something like the polyglot specification needs to exist, and why I believe that to be the case.

The short version is that I have developed a library that I believe to be polyglot compatible, and by that I mean that if there are differences between what this library does and what polyglot specifies that one or both should be corrected to bring them into compliance.

I didn’t write this library simply because I am loonie, but very much to solve a real problem.

The problem is that HTML source files exist that contain artifacts like consecutive <td> elements; people process such documents using tools such as anolis; and such libraries often depend on — for good reasons — libraries such as libxml2 which do an imperfect job of parsing HTML correctly.  The output produced by such tools when combined with such libraries are incorrect.

Note that I stop well short of recommending that others serve their content as application/xhtml+xml.  Or that tools should halt and catch fire if they are presented with incorrect input.  In fact, I would even be willing to say that in general people SHOULD NOT do either of these things.

Now that I have provided instance proofs of the problem and the solution, I’ll proceed with the longer answer.  I will start by noting that Postel’s law has two halves, and while the HTML WG has focused heavily on the second half of that law, the story should not stop there.

To get HTML right involves a number of details that people often get wrong.  Details such as encoding and escaping.  Details that have consequences such as XSS vulnerabilities when the scenario involves integrating content from untrusted sources.  Scenarios which include comments on blogs or feed aggregators.  Scenarios that lead people to write sanitizers and employ the use of imperfect HTML parsers.

It is well and good that Henri maintains — on a best effort basis only — a superior parser for exactly one programming language.  Advertising this library more won’t solve the problem for people who code in languages such as C#, Perl, PHP, Python, or Ruby.  Fundamentally, a tools will save us response is not an adequate response when the problem is imperfect tools.

This problem that needs to be addressed is very much the flip side, and complement to, the parsing problem that HTML5 has competently solved.  Given a handful of browser vendors and an uncountable number of imperfect documents, it very much make sense for the browser vendors to get together and agree on how to handle error recovery.  By the very same token, it makes sense for authors who may produce a handful of pages to be processed by an uncountable number of imperfect tools to agree on restrictions that may go well beyond the minimal logical consequences from normative text elsewhere if those restrictions increase the odds of the document produced being correctly processed.

Yes, it would be great if this weren’t necessary and all tools were perfect.  Similarly, it would be great if browser vendors didn’t have to agree on error recovery as this makes the creation of streaming parsers more difficult.  The point is that while both would be great, neither will happen, at least not any time soon.

These restrictions may indeed go beyond “always explicitly close all elements” and “always quote all attribute values”.  It may include such statements as “always use UTF-8”.

Such restrictions are not a bad thing.  In fact, such restrictions are very much a good thing.