It’s just data

Humpty Dumpty

Steven Pemberton: I’m not arguing about processing, I’m arguing about the document. And as far as I am concerned, when it comes to saying what sort of document it is, the author is most certainly normative.

I must say that this discussion reminds me of this passage in the Through the Looking Glass:

‘When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

‘The question is,’ said Alice, ‘whether you can make words mean so many different things.’

‘The question is,’ said Humpty Dumpty, `which is to be master - - that’s all.'

So, while I doubt that I will ever understand why there are people who insist on calling their pages they produce with the intention of being processed as HTML by the name “XHTML”, I can’t deny that there clearly is something that such people want.  (By way of comparison, I am quite happy to say that this page is served as XHTML to browsers that support such, and as HTML to browsers that don’t).  From my experience, this is tricky stuff, and not something that should be recommended lightly.)

But in any case, one such thing that apparently some people want is to produce what they perceive as cleaner markup.  Cleaner is clearly in the eye of the beholder, but for some people that means things like quoting attributes.  In fact, that is something that HTML 4.01 explicitly recommended — a recommendation that, to date, has not carried forward to HTML5.  Their reasons people might want to follow this recommendation vary, but for some the intent is to produce polyglot documents.

A conformance checker can assist such people, and to that end I’ve development a patch to’s Tokenizer that is this totally unofficial, experimental, and subject to change.

--- src/nu/validator/htmlparser/impl/	(revision 574)
+++ src/nu/validator/htmlparser/impl/	(working copy)
@@ -1193,6 +1193,7 @@
     private void addAttributeWithoutValue() throws SAXException {
+        warn("Unquoted attribute value.");
         // [NOCPP[
         if (metaBoundaryPassed && AttributeName.CHARSET == attributeName
                 && ElementName.META == tagName) {
@@ -1878,6 +1879,7 @@
                                 state = Tokenizer.ATTRIBUTE_VALUE_UNQUOTED;
+                                warn("Unquoted attribute value.");
                                 reconsume = true;
                                 continue stateloop;
                             case '\'':
@@ -1933,6 +1935,7 @@
                                 state = Tokenizer.ATTRIBUTE_VALUE_UNQUOTED;
+                                warn("Unquoted attribute value.");
                                 continue stateloop;

At my request, Mike Smith has deployed this patch (again: unofficially, experimentally, and subject to change) on the qa-dev site.  Feel free to experiment with it.

I’m not certain how to proceed with this.  Should this be an abandoned?  Should this be an option?  Should this be a completely separate validator?

If this continues (and I stress if), what other “best practices” would people find it helpful to identify?