It’s just data

Humpty Dumpty

Steven Pemberton: I’m not arguing about processing, I’m arguing about the document. And as far as I am concerned, when it comes to saying what sort of document it is, the author is most certainly normative.

I must say that this discussion reminds me of this passage in the Through the Looking Glass:

‘When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

‘The question is,’ said Alice, ‘whether you can make words mean so many different things.’

‘The question is,’ said Humpty Dumpty, `which is to be master - - that’s all.'

So, while I doubt that I will ever understand why there are people who insist on calling their pages they produce with the intention of being processed as HTML by the name “XHTML”, I can’t deny that there clearly is something that such people want.  (By way of comparison, I am quite happy to say that this page is served as XHTML to browsers that support such, and as HTML to browsers that don’t).  From my experience, this is tricky stuff, and not something that should be recommended lightly.)

But in any case, one such thing that apparently some people want is to produce what they perceive as cleaner markup.  Cleaner is clearly in the eye of the beholder, but for some people that means things like quoting attributes.  In fact, that is something that HTML 4.01 explicitly recommended — a recommendation that, to date, has not carried forward to HTML5.  Their reasons people might want to follow this recommendation vary, but for some the intent is to produce polyglot documents.

A conformance checker can assist such people, and to that end I’ve development a patch to Validator.nu’s Tokenizer that is this totally unofficial, experimental, and subject to change.

===================================================================
--- src/nu/validator/htmlparser/impl/Tokenizer.java	(revision 574)
+++ src/nu/validator/htmlparser/impl/Tokenizer.java	(working copy)
@@ -1193,6 +1193,7 @@
     }
 
     private void addAttributeWithoutValue() throws SAXException {
+        warn("Unquoted attribute value.");
         // [NOCPP[
         if (metaBoundaryPassed && AttributeName.CHARSET == attributeName
                 && ElementName.META == tagName) {
@@ -1878,6 +1879,7 @@
                                  */
                                 clearLongStrBuf();
                                 state = Tokenizer.ATTRIBUTE_VALUE_UNQUOTED;
+                                warn("Unquoted attribute value.");
                                 reconsume = true;
                                 continue stateloop;
                             case '\'':
@@ -1933,6 +1935,7 @@
                                  */
 
                                 state = Tokenizer.ATTRIBUTE_VALUE_UNQUOTED;
+                                warn("Unquoted attribute value.");
                                 continue stateloop;
                         }
                     }

At my request, Mike Smith has deployed this patch (again: unofficially, experimentally, and subject to change) on the qa-dev site.  Feel free to experiment with it.

I’m not certain how to proceed with this.  Should this be an abandoned?  Should this be an option?  Should this be a completely separate validator?

If this continues (and I stress if), what other “best practices” would people find it helpful to identify?


The problem with doing this, as far as I can see, is that we have enough problem about deciding what should be conforming (due to valid technical reasons against things) yet alone what “best practices” are. I can’t see such a discussion becoming anything but another permathread that spends its life going around in circles.

Posted by Geoffrey Sneddon at

Well...

IMHO such an option in the validator is definitely a step in the right direction!

Closing all elements (XHTML style) has been a “best practice” for a long time now in my neck of the woods.
As have to never exclude tags from mandatory elements with optional start- and end tags (tbody, html, head, body, li, p and the rest of them)

Using XHTML made that practice really simple to QA and enforce in most cases - non validation meant non conformant / rejected markup. Simple as pie (text/html versus application/xhtml+xml issues aside of course - but the practice was really easy to implement using the validator as a required tool and has really helped enforcing coding styles over the years)

IMHO - If you guys really are serious about giving us authors the option to continue to author our documents using “XHTML syntax” in HTML5 - an option to instruct the validator.nu validator to throw up on “non-well-formedness according to XML or non-lowercase tagnames” regardless of mime-type (you know what I mean) would be really helpful as a Q/A tool.

Posted by Jarvklo at

while I doubt that I will ever understand why there are people who insist on calling their pages they produce with the intention of being processed as HTML by the name “XHTML”

You have an HTTP view of it (understanding of protocol). Other people have a document view of it (no understanding of protocol).

There is the same kind of divide between X/HTML integrator (those who are defining the templates), understanding of markup, and the artistic directors (those who are defining the layout), no understanding of markup.

Posted by karl dubost at

Sam Ruby: Humpty Dumpty

submitted by ossreleasefeed [link] [comment]...

Excerpt from programming at

IMHO, it seems that people don’t want to check if their code conforms to the specification, but rather to their own coding style. That could even include things like using spaces instead of tabs or starting every new paragraph on a new line in the source. That’s why I believe that this shouldn’t really be a job for a validator, but rather for (heavily customizable) lint-like software.

Posted by Ms2ger at

About the mime type thing: People who don’t understand that thing should be told to try sending a text file containing nothing but html markup with mime type text/plain to their browser. When they see the naked markup on their screen like any plain text file they occasionally see while surfing the web, they would start to understand that the mime type is all that matters with any text file.

The only reason they don’t get it in case of text/html for xhtml is that they can’t see the difference by just looking on their screen.

Posted by Gerenco at

I have a feeling a lot of people want to produce XHTML rather than HTML is that they have a tool they like that understands XML but doesn’t understand HTML-that-is-not-XML. For example, I never found an HTML authoring tool that I was happy with until I found nxml-mode. I author in XHTML not because I care about its prettier markup, but because I can use nxml-mode to produce it. Authoring tools aside, I’m sure many people have some link in their publishing toolchain that expects XML, and so it’s handy to use an XML flavor of HTML.

Posted by Ryan Shaw at

Ryan: if you’d like to edit XHTML5 in nxml-mode with validation, give this a try.

Posted by Edward O'Connor at

Geoffrey: I’m not asking for consensus, I’m wondering if there are a set of people that find this useful.

Jarvklo: What I will ultimately need is concrete feedback in the form of HTML fragments that people would like to see messages produced for.

karl: the bytes on this page when interpreted as a document are valid XHTML5.  The result is also valid HTML5.  The difference is not in how these bytes are are authored, but in how they are processed.  As a specific example, XHTML and HTML interpret <title> differently.  I’ve happened to avoid such issues, but I know of no tool that helps people identify such differences that would affect their content.

Ms2ger: validator.nu is both available online, and can be downloaded and run locally.  If you wish to call it a lint tool that’s fine with me.  I’m interested in knowing what customizations people are interested in.

Posted by Sam Ruby at

“Save as” this page. Open it in authoring tool on my laptop such as textmate. This page is what the author says it is not how it has been sent.

I understand your point of view, because it seems that your document stops to exist after the browser. What Steven seems to promote is that the document exists before the Web and after the Web (such as filesystems).

Steven’s position is even reinforced by the fact that (follow me)

Example generations of people using ftp to put (x)html pages online. Many CMS do not follow Web principles. Many people do not manage URIs. They do not understand HTTP either. (btw, that doesn’t make me happy, but if we talk about "real world"®, that has to be taken into account.)

People write a document in a language. Then put this document online (accessible through the Web) as they do for a JPEG image, a flash, etc. The fact that it is viewable in a rendering engine which is called a browser does not matter that much.

Posted by karl dubost at

This page is what the author says it is not how it has been sent.

Pray tell, what does "the author say it is"?  As the author of this page, I must say that I really am curious...

Posted by Sam Ruby at

I had exactly the same discussion today with a front-end Web developer where I work (Web agency). He was asking me “What should I use Karl, HTML 4 or XHTML 1?”

I replied: “what are you more comfortable with?”

He said: “I tried HTML 4 Doctype and I got 38 validation errors because I had all my tags closed as I do for XHTML 1. I don’t like that. So I put back the XHTML 1 doctype. I prefer to write XML, with my tags closed.”

I said: “You know the browser don’t bother with the doctype. The browser processes depending on your mime type sent by the server. Though you can just put the html 5 doctype like this… (me writing it)… and use the experimental mode of the validator to check your markup.”

He replied: “Aaah so short? ok ok. As long as I can write XML pages.”

I said: “but you know it will not be processed as XML.”

He said: “Yes sure, but the sysntax is XML and it looks the same in the browser.”

Posted by karl dubost at

ooops I have forgotten to reply your question:

Pray tell, what does "the author say it is"?  As the author of this page, I must say that I really am curious...

According to the developer above, you close your empty tags. You use double quotes, your elements are balanced: XML!

Posted by karl dubost at

Sam Ruby: Humpty Dumpty

つまんないなあ。。...

Excerpt from はてなブックマーク - 徒栞 - お気に入り at

Well, if warnings for quoted attribute values are done, I’d be interested in warnings for tab characters too. I personally always use spaces. I also have a certain indentation style that I try to apply consistently and modify somewhat over time... If that could be checked that’d be neat too. (Trying to answer your question about else I’d like a “validator” to check.)

Posted by Anne van Kesteren at

Here’s a patch (to be applied after the previous patch) to issue warnings on the presence of tab characters when in the “DATA” state.  This is the most common case, but there are, of course, a number of other states that should be considered (in particular, RCDATA and CDATA).  A proper approach would be to produce test cases for each, and then to implement the changes necessary to get the test cases to pass.

===================================================================
--- Tokenizer.java	(revision 574)
+++ Tokenizer.java	(working copy)
@@ -1427,6 +1428,9 @@
                             case '\r':
                                 emitCarriageReturn(buf, pos);
                                 break stateloop;
+                            case '\t':
+                                warn("Tab character.");
+                                continue;
                             case '\n':
                                 silentLineFeed();
                             default:

If you would like to experiment with it, Mike Smith has deployed this patch (again: unofficially, experimentally, and subject to change) on the qa-dev site.

Posted by Sam Ruby at

Possible warnings:

Posted by karl dubost at

The body of your post has some incomplete markup. To wit: “[http://wiki.whatwg.org/wiki/HTML_vs._XHTML tricky stuff”

Also, Anne was surely just taking the piss in his comment, not asking for a patch? Surely?

Posted by Phil Wilson at

re: incomplete markup: fixed.  Thanks!

Posted by Sam Ruby at

I did not mean to be sarcastic. I’m not sure these should be syntax requirements, but I would like tools to check for them so I do not have to look for these things myself. Another thing that nobody seems to have mentioned so far is that the HTML syntax allows uppercase tag and attribute names where most people prefer all lowercase nowadays, I think. Might be another thing to check.

Posted by Anne van Kesteren at

I’m trying to move back to HTML4 from XHTML1.1 and my major problem was precisely letting go of nxml-mode.  However, I found the old Emacs SGML/psgml modes are actually quite good, and with some coercion can duplicate everything I used nxml for.

As for the quotes, I’ll have to agree with the first poster.  I think conforming to the specification is already too much stuff to keep in mind; I don’t want to deal with that, plus best-practices.  If you guys like quotes just put them on the spec, if not make them entirely optional; I don’t like the state of optional-but-not-quite.

Posted by Leonardo Boiko at

Sam Ruby: Humpty Dumpty

[link]...

Excerpt from Delicious/molily/qa at

[from jarvklo] Sam Ruby: Humpty Dumpty

[link]...

Excerpt from Delicious/url/e444d73854c4ed363295ebbab4fd0e6c at

Add your comment