It’s just data

Polyglot Validation

The HTML5 Super Friends: The spec should clarify that an author can use XHTML or HTML syntax, that it is a coding style preference. It would be great if Henri could add a toggle to the validator that will check for syntax. Something along the lines of “Also check for XHTML syntax validity.”

At the moment Henri is a bit firefox-focused, but I’ll volunteer to help to the extent that I can, with one key proviso, namely that the “super friends” produce concrete test cases for what they want flagged.  The test cases I have in mind should be very simple to write: a 10 to 30 line HTML document which demonstrates the condition to be flagged, and the text of an even-toned message to be produced.

The reason I’m insisting on test cases is that while this request sounds deceptively simple, the fact of the matter is that there are a lot of unanswered questions.  At the present time I have neither the time nor the inclination to work out all of the details for the corner cases.  I’ll even go further: I don’t even care if any particular test meets the criteria of “XHTML syntax validity”, I just care that the tests are properly vetted.

I’ll initially work with Mike Smith to deploy the code on the W3C qa-dev site which you can experiment with.  Note the (numerous) messages for unquoted attributes.  Once this is good enough for wider consumption, I’ll get Henri to deploy the code to Validator.nu.  Eventually this code will make it to the primary W3C site.  People will also be able to download the source and deploy it within their own firewalls.

It is my hope that we can quickly iterate.  Even if I only implement one major feature a week, getting to the 80% use case should be a matter of a month or two at most.  A single well placed check should detect most unclosed tag issues.  If there is one (or even several) situations which go undetected, providing additional test case(s) would ensure that these are caught in subsequent iterations.  And over time, this will only improve.


It’s nice to see Jeffrey Zeldman looking out for my interests.

I like Henri’s suggestion of building a SAX tree using both HTML and XHTML parsing rules, and comparing. Presumably, that would satisfy Zeldman’s requirements, but it wouldn’t catch things like <script><!-- ... inline javascript ... --></script>, which are actually important for polyglot publishers.

Posted by Jacques Distler at

Oh wait.  Nevermind ....

Posted by Jacques Distler at

The specific example will be parsed differently.  What is a comment in XML will be text in HTML.

There exist techniques which, while parsed differently, will parse differently in ways that don’t matter.  And for small scripts which might only use > but never <, no escaping may be required at all.

Per Henri’s questions, it might be the case that the people Jeffrey Zeldman represents do NOT care about flagging &nbsp;.  Or they might wish to continue to use the XHTML 1.1 DOCTYPE but wish to have the added benefits a polyglot validator could provide given that they will be serving the content as text/html to at least some user agents.

I continue to suggest that this is not a problem with easy answers that solve the problem 100%, but I do believe that there are easy solutions to the 80% cases, and that asymptotically approaching the 100% case is something I am willing to collaborate on.  All I ask is for vetted test cases.

Posted by Sam Ruby at

submitted by gthank [link] [comment]...

Excerpt from programming at

There exist techniques which, while parsed differently, will parse differently in ways that don’t matter.

Right, yeah. That’s what I had meant to express.

It is presumably trivial to validate a document as HTML, then flip a switch and re-validate it as XHTML. I assume that what someone who would go to the trouble of doing that would really like to know is whether the two documents would behave the same (for some definition of "same").

Thus, e.g., it would be useful to flag <table>s without explicit <tbody> elements (these behave differently), but not so useful to flag escaped inline javascript that yields a different parse tree, but which behaves the same.

Posted by Jacques Distler at

It is presumably trivial to validate a document as HTML, then flip a switch and re-validate it as XHTML.

I haven’t looked closely at Henri’s code, but I don’t presume anything.  Validating that the same stream of bytes conforms to HTML5 and conforms to XHTML5 doesn’t ensure that both would result in the same DOM.

Parsing it twice and comparing the DOMs would do that, but typical XML parsers would stop at the first unquoted attribute.

Doing a deep analysis of what differences matter is something I’m looking for help with.  And the way I would prefer to proceed is with a set of smallish documents which should be flagged by the validator for whatever reason.  If I can manage to make that happen with one pass (like I apparently can on unquoted attributes), so much the better.

Posted by Sam Ruby at

Sam,
Thank you for volunteering to help with the validator/validation request. Very much appreciated, and I think your methodology of requesting specific test cases illustrative of specific rules to be checked is very pragmatic. I’m optimistic that implementation accordingly is likely to satisfy the request.  There are good questions in your post and the comments. Watch the following for updates:

Super Friends Guide to HTML5 Hiccups: Validation of XHTML and HTML

-Tantek

Posted by Tantek at

Probably, your comment section is not the place for testcases, but, in the spirit of getting the ball rolling, here are two, anyway:

<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Implied Tbody Elements</title>
<style>tbody {color:green}</style>
</head>
<body>
<table><tr><td>green</td></tr></table>
<p>Should be flagged:</p>
<pre>Warning: Implied <code>&lt;tbody&gt;</code> element, present in HTML,
will be absent, when parsed as XHTML.</pre>
</body>
</html>

and

<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Inline Scripts</title>
<script type="text/javascript">
/* <![CDATA[ */
 a = 0;
/* ]]> */
</script> 
</head>
<body>
<p>Should not be flagged.</p>
</body>
</html>
Posted by Jacques Distler at

That was the goal of the html5-xhtml5 document to explore what could be compatible.

Posted by karl dubost at

Some more tests along these lines.

Posted by Jacques Distler at

It’s nice to see Jeffrey Zeldman looking out for my interests.

The details of the Super Friends request are not clear enough at this time to be able to tell if Zeldman is really looking out for your interest.

It seems to me that only you and Sam actually have use cases for a polyglot validator. A user base of two is rather small, though. I haven’t noticed the Super Friends doing dual MIME type publishing.

<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Inline Scripts</title>
<script type="text/javascript">

/* <![CDATA[ */
 a = 0;
/* ]]> */
</script> 
</head>
<body>
<p>Should not be flagged.</p>
</body>
</html>

Note that this document produces different DOMs as text/html and application/xhtml+xml. In text/html, the line break between </body> and </html> gets hoisted into the content of body. In application/xhtml+xml, the line break ends up in a text node sibling following the body node.

(Sam: It seems that your comment system doesn’t deal with putting a pre inside a blockquote.)

Posted by Henri Sivonen at

Note that if polyglot checking is taken to its logical extreme, it is impossible to construct an (X)HTML5 polyglot document due to xmlns="http://www.w3.org/1999/xhtml". Presumably, it is implied that at least on that point true tree sameness be relaxed.

Posted by Henri Sivonen at

The details of the Super Friends request are not clear enough at this time to be able to tell if Zeldman is really looking out for your interest.

Your Validator already provides the option of toggling between validating a given stream of bytes as HTML and as XHTML. So I presume that the Super Friends either

a) have not looked at your Validator or
b) want something more than the knowledge that a given stream of bytes yields a valid HTML/XHTML document.

You’re right that they have not clearly articulated what else it is that they want. So, perhaps I have misinterpreted ...

It seems to me that only you and Sam actually have use cases for a polyglot validator. A user base of two is rather small, though.

Every user of Instiki is, by default, a polyglot publisher. (The default Markdown markup is served in polyglot fashion; Textile (the old default) is served only as text/html).

Now, it’s true that

a) aside from its support for <audio> and <video>, Instiki does not currently pretend to produce HTML5.
b) most users couldn’t care less about validity, and so would have no use for a Polyglot Validator.

But those caveats probably apply more widely ...

Posted by Jacques Distler at

It seems to me that only you and Sam actually have use cases for a polyglot validator

I will add Rails and Wordpress and others to that list.  Henri is technically correct that while neither need to put trailing slashes on meta, link, br and other elements, they both chose to do so.  They also chose to consistently quote all attributes, and avoid the shortcuts of omitting closing elements.  What reasons they might have for doing all of these doesn’t much matter to me.  Easier to learn, less to remember, talismans, superstition, fashion, aesthetics: all of these are possible reasons, and all are debatable.  What is clear to me, however, is that the demand is there.

I also sense that as we explore this there will be differences in opinion as to which things should be flagged.  People seem to be split over whether &nbsp; should be flagged.  I don’t care too much about the missing tbody issue, but others do.

And, predictably, some view all this as a waste of time.  That’s OK too.

Posted by Sam Ruby at

I don’t care too much about the missing tbody issue, but others do.

If you do any DOM manipulation then firstChild/lastChild constructs are going to fall down on issues like that (and, for that matter, on the whitespace text-node issues that Henri mentioned).

To my mind, at least, what a polyglot validator could offer (over and above the existing ability of Henri’s Validator to validate a stream of bytes as either HTML or XHTML) is the ability to catch such differences in the DOM.

Things will get even more fun when browsers start to support MathML+SVG-in-HTML.

I will add Rails and Wordpress and others to that list.

In my copious free time, I hope to eventually produce a polyglot variant of Melody.

Posted by Jacques Distler at

If you do any DOM manipulation then firstChild/lastChild constructs are going to fall down

Me thinks that “any” is a wee bit too strong of a word to use here. :-)

Which is the crux of the matter.  As Henri points out, you can’t even get past the <html> element without producing a DOM that has some difference between what an HTML5 parser will produce and what a namespace aware XML parser will produce.  So, the next question is what differences matter?

You, yourself, posted a test case involving commented scripts where there will differences that may be relevant to DOM manipulating scripts.  And posted it just after a test case that will produce differences that are only relevant to DOM manipulating scripts (and possibly some CSS rules).

Net: which differences are important turns out to be a judgment call.  And I’m entirely OK with that, just so long as it isn’t me that has to make (and defend!) the individual choices.  :-)

If nothing else, I expect that flagging of implicitly closed elements will be popular.

Posted by Sam Ruby at

Indeed, “any” is a little too strong. But I certainly have gotten tripped up by such DOM differences (and I, personally, do very little DOM scripting).

In a similar vein, even whitespace normalization in attribute values can lead to startlingly different behaviours.

Aside from the assurance that “This stream of bytes would form a valid XHTML document, if I chose to send them as application/xhtml+xml, but I’m not going to do that, so it doesn’t really matter whether that document would behave as I expect it to.” what are we trying to assure?

I have a fairly good idea of what things are likely to screw me up (and, hence, which I would like to see flagged). I have less of an idea which things are likely to screw up Jeremy Zeldman (and, hence, which he would like to see flagged).

Posted by Jacques Distler at

what are we trying to assure?

I think people are trying to assure different things.  I think Jeffrey Zeldman believes that a profile of HTML5 that has considerably less variability is easier to teach.  For me, I would like to ensure that my content can be parsed by an XML parser for data mining reasons, even if the DOM is different (note: as HTML5 parsing libraries become more ubiquitous, this will be less of a concern for me).  Tantec would like differences which can lead to differences in behavior to be flagged, even though he knows that 100% is neither achievable nor desirable.

These goals are mostly aligned, though I strongly suspect that the “Super Friends” stated goal of a single toggle is not achievable.  In particular, I’m willing to bet that Jeffrey would prefer a mode that did not flag uses of &nbsp; in any way.  But at the moment, that’s just conjecture on my part.

For my part, I would not object either way to the inclusion or exclusion of the missing tbody test; and will go further and say that if it were included, I would change my content to conform (not necessarily a big deal as I rarely use tables anyway).

Posted by Sam Ruby at

Ooops: s/Jeremy/Jeffrey/ !

I think I’ve articulated the sorts of things that concern me. A whole 'nother set of issues will probably arise when it comes time to start trying to do MathML-and-SVG-in-HTML. Again, a polyglot validator would be very useful to deal with them.

I’m not sure why Jeffrey Z is dissatisfied with the existing facilities of Henri’s Validator. Perhap he needs to articulate what more he would like to see (since you and I are just speculating about that).

Posted by Jacques Distler at

I did bring up this issue on the WHAT WG list a month ago.

My reasoning is explained on my blog: [link]

I initially proposed that a few XHTML-features should be conformance criteria. I wanted to test the boundaries of such a proposal.

However, working as an educator, I can affirm that there are huge benefits to applying the most common features of XHTML even in an text/html situation, and that being able to automatically check for those features would be a tremendous help. This is an experience I share with practically every single standards minded educator with whom I’ve ever discussed this.

There is no evidence and no research that indicates the opposite. Henri Sivonen’s concern about “poisoned minds” are simple not a big issue.

Simon Pieters did open up two bugs on the validator’s bug tracker on these issues:

[link]
[link]

Time permitting, I will partake in this discussion even more. Right now I am waiting for the “superfriends” to bring it to the lists, though.

Posted by Lars Gunther at

I initially proposed that a few XHTML-features should be conformance criteria. I wanted to test the boundaries of such a proposal.

So would using the (existing) option in Henri’s Validator, to validate the document as XHTML, suffice for your purposes, or would you require something more specialized (e.g. conformance to some subset of XHTML-like features, rather than full validity)? If the latter, is there some consensus as to which XHTML-like feature should be included?

Posted by Jacques Distler at

I will try to get back to the list with a more full proposal, time permitting. But a thought I am having is that since HTML5 is introducing a lot of boolean attributes, especially for forms, it will be an awful lot of markup of one can not use them in their short form. So in my teaching my students it would be great to have an option to allow short boolean attributes while still enforcing quotation marks on all other attributes.

Now, since there is no consensus on this issue, other people have suggested that quotation marks should be checked for only on attributes that may take values where omitting them actually might break things. I.e. check for them on the alt or title attribute, but not on the width or height, on an image. Speaking as a teacher I think they are better off using quotation marks on all non-boolean attributes when they start out, and then as they get more skilled the checks could be relaxed a bit.

Choosing between two sub-optimal solutions: No enforcing of quotation marks and enforcing them even on boolean attributes, I’d opt for the latter. That seems to be doable in the validator as it is.

I have also written a lengthy reply on Zeldmans blog: [link] (to comment)

Posted by Lars Gunther at

Choosing between two sub-optimal solutions: No enforcing of quotation marks and enforcing them even on boolean attributes, I’d opt for the latter. That seems to be doable in the validator as it is.

Don’t worry unduly about the implementation costs.  Dealing with an attribute without a value, and dealing with an attribute with an unquoted value are both separate code paths.  Flagging one or both or neither are all available options.

I continue to be skeptical that a single “toggle” would be sufficient.  If there is a demand for an “education” profile (flagging, say, implicit insertion and closing of elements and attributes with values without quotes) and a separate “enterprise” profile (flagging all of the above as well as attributes without values and named entity references other than the five predefined values) then I’m OK with that.  Or any other combination or number of profiles.

As I said, these lines are subjective, so it wouldn’t be surprising or troublesome to me if Tantek’s requirements go beyond yours.

Posted by Sam Ruby at

Sam, thanks for your leadership and volunteerism.

Zeldman is really looking out for your interest.

Just wanted to point out that the HTML5 Super Friends is a group effort.

Posted by zeldman at

Don’t worry unduly about the implementation costs.

validator.diff and htmlparser.diff are patches that will allow the user to select a “profile” (currently “permissive”, “pedagogical”, and “polyglot”, where “permissive” is whatever the validator.nu reports now, “pedagogical” adds warning messages for attributes with values but no quotes, and “polyglot” adds warning messages for attributes with no values.)

Posted by Sam Ruby at

HTML vs XHTML is an excellent source for ideas...

Posted by Sam Ruby at

Sam Ruby: Polyglot Validation

Sam Ruby: Polyglot Validation Wed 02 Sep 2009 at 14:30The HTML5 Super Friends: The spec should clarify that an author can use XHTML or HTML syntax, that it is a coding style preference. It would be great if Henri could add a togg... vantguarde HTML...

Excerpt from vantguarde / HTML (918) at

HTML5 watch

Keeping up with HTML5 can seem like a full-time job if you’re subscribed to both the W3C public-html list and the WHATWG mailing list . If you have to choose just one, the WHATWG list is definitely the red pill. The W3C list has a very high volume...

Excerpt from Adactio at

Add your comment