Steven Pemberton: I’m not arguing about processing, I’m arguing about the document. And as far as I am concerned, when it comes to saying what sort of document it is, the author is most certainly normative.
‘When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’
‘The question is,’ said Alice, ‘whether you can make words mean so many different things.’
‘The question is,’ said Humpty Dumpty, `which is to be master - - that’s all.'
So, while I doubt that I will ever understand why there are people who insist on calling their pages they produce with the intention of being processed as HTML by the name “XHTML”, I can’t deny that there clearly is something that such people want. (By way of comparison, I am quite happy to say that this page is served as XHTML to browsers that support such, and as HTML to browsers that don’t). From my experience, this is tricky stuff, and not something that should be recommended lightly.)
But in any case, one such thing that apparently some people want is to produce what they perceive as cleaner markup. Cleaner is clearly in the eye of the beholder, but for some people that means things like quoting attributes. In fact, that is something that HTML 4.01 explicitly recommended — a recommendation that, to date, has not carried forward to HTML5. Their reasons people might want to follow this recommendation vary, but for some the intent is to produce polyglot documents.
A conformance checker can assist such people, and to that end I’ve development a patch to Validator.nu’s Tokenizer that is this totally unofficial, experimental, and subject to change.
At my request, Mike Smith has deployed this patch (again: unofficially, experimentally, and subject to change) on the qa-dev site. Feel free to experiment with it.
I’m not certain how to proceed with this. Should this be an abandoned? Should this be an option? Should this be a completely separate validator?
If this continues (and I stress if), what other “best practices” would people find it helpful to identify?
The problem with doing this, as far as I can see, is that we have enough problem about deciding what should be conforming (due to valid technical reasons against things) yet alone what “best practices” are. I can’t see such a discussion becoming anything but another permathread that spends its life going around in circles.
IMHO such an option in the validator is definitely a step in the right direction!
Closing all elements (XHTML style) has been a “best practice” for a long time now in my neck of the woods.
As have to never exclude tags from mandatory elements with optional start- and end tags (tbody, html, head, body, li, p and the rest of them)
Using XHTML made that practice really simple to QA and enforce in most cases - non validation meant non conformant / rejected markup. Simple as pie (text/html versus application/xhtml+xml issues aside of course - but the practice was really easy to implement using the validator as a required tool and has really helped enforcing coding styles over the years)
IMHO - If you guys really are serious about giving us authors the option to continue to author our documents using “XHTML syntax” in HTML5 - an option to instruct the validator.nu validator to throw up on “non-well-formedness according to XML or non-lowercase tagnames” regardless of mime-type (you know what I mean) would be really helpful as a Q/A tool.
while I doubt that I will ever understand why there are people who insist on calling their pages they produce with the intention of being processed as HTML by the name “XHTML”
You have an HTTP view of it (understanding of protocol). Other people have a document view of it (no understanding of protocol).
There is the same kind of divide between X/HTML integrator (those who are defining the templates), understanding of markup, and the artistic directors (those who are defining the layout), no understanding of markup.
IMHO, it seems that people don’t want to check if their code conforms to the specification, but rather to their own coding style. That could even include things like using spaces instead of tabs or starting every new paragraph on a new line in the source. That’s why I believe that this shouldn’t really be a job for a validator, but rather for (heavily customizable) lint-like software.
About the mime type thing: People who don’t understand that thing should be told to try sending a text file containing nothing but html markup with mime type text/plain to their browser. When they see the naked markup on their screen like any plain text file they occasionally see while surfing the web, they would start to understand that the mime type is all that matters with any text file.
The only reason they don’t get it in case of text/html for xhtml is that they can’t see the difference by just looking on their screen.
I have a feeling a lot of people want to produce XHTML rather than HTML is that they have a tool they like that understands XML but doesn’t understand HTML-that-is-not-XML. For example, I never found an HTML authoring tool that I was happy with until I found nxml-mode. I author in XHTML not because I care about its prettier markup, but because I can use nxml-mode to produce it. Authoring tools aside, I’m sure many people have some link in their publishing toolchain that expects XML, and so it’s handy to use an XML flavor of HTML.
Geoffrey: I’m not asking for consensus, I’m wondering if there are a set of people that find this useful.
Jarvklo: What I will ultimately need is concrete feedback in the form of HTML fragments that people would like to see messages produced for.
karl: the bytes on this page when interpreted as a document are valid XHTML5. The result is also valid HTML5. The difference is not in how these bytes are are authored, but in how they are processed. As a specific example, XHTML and HTML interpret <title> differently. I’ve happened to avoid such issues, but I know of no tool that helps people identify such differences that would affect their content.
Ms2ger: validator.nu is both available online, and can be downloaded and run locally. If you wish to call it a lint tool that’s fine with me. I’m interested in knowing what customizations people are interested in.
“Save as” this page. Open it in authoring tool on my laptop such as textmate. This page is what the author says it is not how it has been sent.
I understand your point of view, because it seems that your document stops to exist after the browser. What Steven seems to promote is that the document exists before the Web and after the Web (such as filesystems).
Steven’s position is even reinforced by the fact that (follow me)
not many Web documents are created.
but many documents are created outside of the Web and put on the Web.
Example generations of people using ftp to put (x)html pages online. Many CMS do not follow Web principles. Many people do not manage URIs. They do not understand HTTP either. (btw, that doesn’t make me happy, but if we talk about "real world"®, that has to be taken into account.)
People write a document in a language. Then put this document online (accessible through the Web) as they do for a JPEG image, a flash, etc. The fact that it is viewable in a rendering engine which is called a browser does not matter that much.
I had exactly the same discussion today with a front-end Web developer where I work (Web agency). He was asking me “What should I use Karl, HTML 4 or XHTML 1?”
I replied: “what are you more comfortable with?”
He said: “I tried HTML 4 Doctype and I got 38 validation errors because I had all my tags closed as I do for XHTML 1. I don’t like that. So I put back the XHTML 1 doctype. I prefer to write XML, with my tags closed.”
I said: “You know the browser don’t bother with the doctype. The browser processes depending on your mime type sent by the server. Though you can just put the html 5 doctype like this… (me writing it)… and use the experimental mode of the validator to check your markup.”
He replied: “Aaah so short? ok ok. As long as I can write XML pages.”
I said: “but you know it will not be processed as XML.”
He said: “Yes sure, but the sysntax is XML and it looks the same in the browser.”
Well, if warnings for quoted attribute values are done, I’d be interested in warnings for tab characters too. I personally always use spaces. I also have a certain indentation style that I try to apply consistently and modify somewhat over time... If that could be checked that’d be neat too. (Trying to answer your question about else I’d like a “validator” to check.)
Here’s a patch (to be applied after the previous patch) to issue warnings on the presence of tab characters when in the “DATA” state. This is the most common case, but there are, of course, a number of other states that should be considered (in particular, RCDATA and CDATA). A proper approach would be to produce test cases for each, and then to implement the changes necessary to get the test cases to pass.
I did not mean to be sarcastic. I’m not sure these should be syntax requirements, but I would like tools to check for them so I do not have to look for these things myself. Another thing that nobody seems to have mentioned so far is that the HTML syntax allows uppercase tag and attribute names where most people prefer all lowercase nowadays, I think. Might be another thing to check.
I’m trying to move back to HTML4 from XHTML1.1 and my major problem was precisely letting go of nxml-mode. However, I found the old Emacs SGML/psgml modes are actually quite good, and with some coercion can duplicate everything I used nxml for.
As for the quotes, I’ll have to agree with the first poster. I think conforming to the specification is already too much stuff to keep in mind; I don’t want to deal with that, plus best-practices. If you guys like quotes just put them on the spec, if not make them entirely optional; I don’t like the state of optional-but-not-quite.