It’s just data

In defence of Polyglot

I see that Henri Sivonen is once again being snarky without backing his position.  I’ll state my position, namely that something like the polyglot specification needs to exist, and why I believe that to be the case.

The short version is that I have developed a library that I believe to be polyglot compatible, and by that I mean that if there are differences between what this library does and what polyglot specifies that one or both should be corrected to bring them into compliance.

I didn’t write this library simply because I am loonie, but very much to solve a real problem.

The problem is that HTML source files exist that contain artifacts like consecutive <td> elements; people process such documents using tools such as anolis; and such libraries often depend on — for good reasons — libraries such as libxml2 which do an imperfect job of parsing HTML correctly.  The output produced by such tools when combined with such libraries are incorrect.

Note that I stop well short of recommending that others serve their content as application/xhtml+xml.  Or that tools should halt and catch fire if they are presented with incorrect input.  In fact, I would even be willing to say that in general people SHOULD NOT do either of these things.

Now that I have provided instance proofs of the problem and the solution, I’ll proceed with the longer answer.  I will start by noting that Postel’s law has two halves, and while the HTML WG has focused heavily on the second half of that law, the story should not stop there.

To get HTML right involves a number of details that people often get wrong.  Details such as encoding and escaping.  Details that have consequences such as XSS vulnerabilities when the scenario involves integrating content from untrusted sources.  Scenarios which include comments on blogs or feed aggregators.  Scenarios that lead people to write sanitizers and employ the use of imperfect HTML parsers.

It is well and good that Henri maintains — on a best effort basis only — a superior parser for exactly one programming language.  Advertising this library more won’t solve the problem for people who code in languages such as C#, Perl, PHP, Python, or Ruby.  Fundamentally, a tools will save us response is not an adequate response when the problem is imperfect tools.

This problem that needs to be addressed is very much the flip side, and complement to, the parsing problem that HTML5 has competently solved.  Given a handful of browser vendors and an uncountable number of imperfect documents, it very much make sense for the browser vendors to get together and agree on how to handle error recovery.  By the very same token, it makes sense for authors who may produce a handful of pages to be processed by an uncountable number of imperfect tools to agree on restrictions that may go well beyond the minimal logical consequences from normative text elsewhere if those restrictions increase the odds of the document produced being correctly processed.

Yes, it would be great if this weren’t necessary and all tools were perfect.  Similarly, it would be great if browser vendors didn’t have to agree on error recovery as this makes the creation of streaming parsers more difficult.  The point is that while both would be great, neither will happen, at least not any time soon.

These restrictions may indeed go beyond “always explicitly close all elements” and “always quote all attribute values”.  It may include such statements as “always use UTF-8”.

Such restrictions are not a bad thing.  In fact, such restrictions are very much a good thing.


Did you read the article? It says “By doing this, your documents will almost assuredly be better structured and of higher quality, yet still be able to be treated as HTML5.” That’s like the old XHTML advocacy all over again. Being polyglot has nothing to do with better structure or quality—only about being HTML and XHTML at the same time.

Posted by Henri Sivonen at

Henri: have you written a bug report on this?

Better structured is indeed debatable.  I would, however, defend being careful of matters such as escaping does result in a higher quality output.

Posted by Sam Ruby at

A bug report about what? About the SitePoint article?

Posted by Henri Sivonen at

Ah, my bad.  As to the article, I can see how an author of a HTML validator would care about — and indeed advocate for — people writing documents that can be processed by more tools.  Even tools that are, as I stated, not conforming to the latest standards.

I might have said “more explicitly express their structure” than “better structured”.  But I do agree about “higher quality”.

And therefore, I disagree with “Being polyglot has nothing to do with better structure or quality”.  It is helpful and pragmatic advice to people who face the flip side of the very problem that you have focused on for the past several years.

Posted by Sam Ruby at

“It is helpful and pragmatic advice to people who face the flip side of the very problem that you have focused on for the past several years.”

If the flip side is writing input that works with non-compliant HTML parsers (as opposed to writing input that works with XML parsers), focusing on Polyglot is missing the point completely. If the problem is “I want to write HTML that the non-compliant HTML parser in libxml2 can parse.”, it would make more sense to document a profile that works in a particular set of widely-used non-compliant HTML parsers than to document the what works in XML parsers and hope that the same thing helps with non-compliant HTML parsers, too.

Posted by Henri Sivonen at

Henri, I encourage you to read what standards libxml2 purports to support.  Specifically:

HTML4 parser: [link]

As a person who often codes in Ruby, I make heavy use of Nokogiri, which is based on libxml2.

Posted by Sam Ruby at

Henri, I encourage you to read what standards libxml2 purports to support.

That the parser in question is non-compliant to HTML5 because it does not even try to be is immaterial to my point that Polyglot is the wrong solution for working with HTML5-non-compliant HTML parsers.

Posted by Henri Sivonen at

Forgive me, Henri, but I see that statement as being every bit as false as the following strawman:

That the page is question is non-compliant HTML5 because it does not even try to be is immaterial to the point that the HTML5 specification is the wrong solution for working with HTML5-non-compliant HTML pages.

Non-compliant parsers, as well as non-compliant pages, are a reality.  They outnumber you.  They are both beyond your personal power to correct.  That is reality.

Posted by Sam Ruby at

Non-compliant parsers, as well as non-compliant pages, are a reality.  They outnumber you.  They are both beyond your personal power to correct.  That is reality.

Correct, but presenting Polyglot as a solution is non sequitur.

Posted by Henri Sivonen at

I disagree.  In fact, I have indisputable evidence to the contrary.  My pages (such as this one) are polygot.  They undeniably work better with non-conformant HTML parsers than the source to the HTML5 specification itself does.  And they do so because they don’t make the assumption that every parser is aware of every special case parsing rule that exists in the HTML5 specification.

Henri: you certainly can make the case the Polyglot specification can be improved (bug reports welcome!).  Or you can make the case that it isn’t the only solution to this problem (proposals welcome!).  In fact, if you can point to another solution, you can even make the case that Polyglot isn’t the best solution available.

The one case you can’t make is that the restrictions that are present in the HTML5 specification alone as it currently exists are sufficient.

Posted by Sam Ruby at

There is very likely overlap between the set of restrictions needed to make libxml2’s HTML parser behave and the set of restrictions that make a document Polyglot. But promoting the second set of restrictions instead of the first one is likely to lead to the same kind of detachment from truth as XHTML advocacy of the previous decade.

If you have a solid use case for one set of restrictions, I’d much rather see you promote that set of restrictions instead of promoting another overlapping set of restrictions that has the sort of labeling that will fascinate the uninformed in the same ways that Appendix C did.

I’m not writing down the set of restrictions that I’d prefer you to promote instead, because I don’t have use cases for that set of restrictions.

Posted by Henri Sivonen at

Detachment from truth?  GMAFB

Henri: as previously stated unclosed consecutive <td> elements cause problems not only with anolis when configured to use libxml2, but also with the Ruby nokogori gem.

This is truth, and you can deny it, but doing so will have about the same effect as denying global warming.

Over time, I have developed a successful set of coping mechanisms to deal with this.  For example, I not only always use utf-8, but I also always declare such BOTH in a meta tag AND in the content type.

Should the core HTML5 specification require such?  Absolutely not.  Should an optional profile be defined which extends the specification to provide entirely voluntary additional constraints that have been proven to make your content more likely to be understood by a variety of consumers?  Absolutely.

Posted by Sam Ruby at

Bookmark: bug 19923

Posted by Sam Ruby at

As the inclusion of “XHTML” in the title of the polyglot specification inspires some to produce counter-propaganda and relive 20th century battles, I’ve documented an alternate suggestion in the form of bug 19925.

Posted by Sam Ruby at

“The short version is that I have developed a library that I believe to be polyglot compatible, and by that I mean that if there are differences between what this library does and what polyglot specifies that one or both should be corrected to bring them into compliance.”

Sam, are there not there a bunch of tests or some external-can-be-pointed at constraints that lets people determine what compatible is? Otherwise it seems subject to whomever can furnish winning rhetoric.

Posted by Bill de hÓra at

“It is well and good that Henri maintains — on a best effort basis only — a superior parser for exactly one programming language.  Advertising this library more won’t solve the problem for people who code in languages such as C#, Perl, PHP, Python, or Ruby.  Fundamentally, a tools will save us response is not an adequate response when the problem is imperfect tools.”

See also: https://github.com/rubys/feedvalidator - I want to believe you know where this ends up :)

Posted by Bill de hÓra at

Excellent article; your loving and clarity oriented language is a joy to read.

ITYM s/compliment/complement/

Posted by Johan Sundström at

are there not there a bunch of tests

Never enough.  First installments: wunderbar, builder.  Admittedly, both are the “wrong” way, namely they test serialization instead of deserialization.

See also: https://github.com/rubys/feedvalidator

The best base to build a polyglot validator upon would be Henri’s excellent validator.nu.

ITYM s/compliment/complement/

Fixed.  Thanks!

Posted by Sam Ruby at

Not only do I agree with Sam Ruby in that polyglot documents are easier to process for non-compliant parsers (and thus are a good idea on that basis alone), but I even think that the “detachment from truth as XHTML advocacy of the previous decade” has made HTML better. Without it, we surely wouldn’t have lower-case being the preferred way to write HTML, we wouldn’t have attribute quotes being preferred (however optional they may be) and we wouldn’t be closing tags unless not doing so made stuff look weird in our preferred browser.

XHTML has made HTML better. It’s not a scientifically provable fact, but anyone involved in the field of web development since its inception got to admit that it’s the truth. It’s at least hard to disprove the value XHTML advocacy has had wrt enforcing Be conservative in what you do on HTML. Without it, HTML would basically be on the same qualitative level it was 15 years ago. While I agree it was a detour and that HTML5 is better in every way, XHTML has worked as a Sergeant Hartman on all web developers and made the web, and HTML, better.

Posted by Asbjørn Ulsberg at

“Being polyglot has nothing to do with better structure or quality—only about being HTML and XHTML at the same time.”
Yea, the biggest reason polyglot is still needed for that is because IE8 is still in use. Another 10 years old feature IE8 lacks is most of DOM Level 2, which is probably why jQuery is going to drop support for it in 2.0. I think the fact that IE8 is currently the biggest boat anchor on the web is well known. Google Apps recently dropped support for IE8.

Posted by Yuhong Bao at

Sam, regarding declaring UTF-8 both via HTTP and meta element, if that is an idea you have for Polyglot Markup then perhaps it should be captured and justified in a bug report? I believe it ihas not yet been captured.

Polyglot Markup offers 3 encoding declaration methods for HTML (BOM, meta@charset, HTTP) and 3 methods for XML (BOM, default, HTTP).

There appears to be some quite legacy parsers that don’t understand HTML5’s new meta@charset element - and such parsers may not understand or have trouble with the BOM as well. (I think about text browsers - Lynx and the like.) Is that your motivation? Or perhaps motivation is to make sure that the glitch is catched via HTTP in case the author forgets to declare it in the file?

Posted by Leif Halvard Silli at

Add your comment