It’s just data

Feedback on XHTML

Anne van Kesteren: I hope we can all agree that sending XHTML as application/xhtml+xml is silly

Ah, the sweet smell of flamebait.

It’s funny.  I first read Anne’s post on Planet Intertwingly, where it was XHTML served as application/xhtml+xml to my browser (Firefox).  Whether or not it is valid XHTML is debatable (in fact, it is not), but what is not debatable is that it was well formed XML.

If you like, we can debate whether or not The WHATWG Blog’s feed should include relative URI’s.

But another thing that is not debatable is that with modern browsers, you won’t see any of Jacques Distler's silly MathML formulas, or any of my silly SVG icons unless we serve our content as application/xhtml+xml.

I will also note that the same reasons that text/xml should be deprecated apply equally well to text/html.

But here is something we can agree on: the requirement that pages must be well formed XHTML is too high of a barrier for adoption for technologies such as MathML and SVG.


If you like, we can debate whether or not The WHATWG Blog’s feed should include relative URI’s.

And out of the box, WordPress does the XHTML as text/html thing as well. (Yeah, there’s still some of that left.)

Posted by Henri Sivonen at

Bulletholes

Henri Sivonen bursts my bubble.... [more]

Trackback from Musings

at

I still wonder whether including SVG inline is really the right thing to do. It’s pretty fancy and the way you’re doing it, it even seems a bit semantic, but still. As for text/html and SVG. You always use a data URI. MathML and text/html is being worked upon.

Posted by Anne van Kesteren at

Also, the post is sort of intended as advice. When you know enough about the subject, as I think you do, you can safely ignore it. I believe that goes for any advice.

Posted by Anne van Kesteren at

And out of the box, WordPress does the XHTML as text/html thing as well

I would think that a WHATWG site (blog or otherwise) would want to be an exemplar of the technologies and formats it wants to endorse.  As applied to WordPress, I would presume that that would involve producing, making available, and advocating standards conforming templates for other to use.

As for text/html and SVG. You [can] always use a data URI.

Right.  You criticize application/xhtml+xml because it is not “supported by Internet Explorer, Google, older user agents”, yet you advocate data URIs?

I personally find data URIs harder to author by hand in VIM than SVG.

MathML and text/html is being worked upon.

SVG is the same basic problem.

the post is sort of intended as advice

The post is incomplete and misleading in a number of ways.  There isn’t only one validator.  No format is 100% forward compatible — ask anybody who recently upgraded to IE7.  No format is compatible with all devices.  And what exactly is an "XML tool chain."?

Just so I am not misinterpreted: I believe strongly in the direction that the WHATWG is heading.  I just don’t believe that this type of advocacy is helpful.

Posted by Sam Ruby at

I would think that a WHATWG site (blog or otherwise) would want to be an exemplar of the technologies and formats it wants to endorse.  As applied to WordPress, I would presume that that would involve producing, making available, and advocating standards conforming templates for other to use.

Since getting WordPress to function in an application/xhtml+xml world is durned near impossible, I suspect that means supplying a set of HTML 4.01 templates.

MathML and text/html is being worked upon.

SVG is the same basic problem.

Much as I, selfishly, would like to see MathML in text/html (essentially, by adding Presentational MathML to HTML5), it’s clear that WhatWG is not going to repeat the same move for inline SVG.

There isn’t only one validator.

The XHTML validation services that I know of are

1. The W3C Validator.
2. Validome
3. The WDG Valdator.

Each is broken in some important fashion. (The latter two will falsely claim my pages to be invalid. The first will label ill-formed XHTML content as valid.)

Posted by Jacques Distler at

Since getting WordPress to function in an application/xhtml+xml world is durned near impossible

At one time, I’m pretty sure that that could have been said about Planet.

it’s clear that WhatWG is not going to repeat the same move for inline SVG.

Citations?

The XHTML validation services that I know of are

Don’t forget Henri’s.  But, alas, it is not quite there yet either.

Posted by Sam Ruby at

it’s clear that WhatWG is not going to repeat the same move for inline SVG.

Citations?

I probably don’t need to point you to the extensive discussions revolving around whether XML namespace syntax will be supported in the HTML5 serialization (bottom line: no).

In the case of Presentational MathML, this is not a problem, as PMML element names are all distinct from existing or proposed HTML element names.

[The proposal is that these elements will be intercepted, and placed in the MathML namespace in the DOM. It is claimed that this can be done without explicit namespace support in the serialization.]

In the case of SVG, this is not possible, as there are element name collisions (title, script, a, ...), which require namespaces to disambiguate.

Don’t forget Henri’s.  But, alas, it is not quite there yet either.

Since I’m not serving (X)HTML5, Henri’s validator is not going to do me much good.

Posted by Jacques Distler at

it’s clear that WhatWG is not going to repeat the same move for inline SVG.

Citations?

I probably don’t need to point you to the extensive discussions revolving around whether XML namespace syntax will be supported in the HTML5 serialization (bottom line: no).

Actually, I’m blissfully unaware of the prior discussions.

In the case of Presentational MathML, this is not a problem, as PMML element names are all distinct from existing or proposed HTML element names.

[The proposal is that these elements will be intercepted, and placed in the MathML namespace in the DOM. It is claimed that this can be done without explicit namespace support in the serialization.]

In the case of SVG, this is not possible, as there are element name collisions (title, script, a, ...), which require namespaces to disambiguate.

The way the feedparser works is that it looks for an ancestor element of <math> and <svg> respectively.  I have slightly tighter support in place when the content is identified as xhtml and is well formed, but the fallback parser is based on SGML and isn’t aware of namespaces — though I do currently look for an xmlns attribute on the ancestral marker element and verify that it matches.  This means that while namespace prefixes won’t work, and one can’t simply inherit the default namespace from earlier in the stream, I do handle the normal case even in entity escaped HTML in not-well formed RSS: testcase

Posted by Sam Ruby at

I would think that a WHATWG site (blog or otherwise) would want to be an exemplar of the technologies and formats it wants to endorse.

The WHATWG faces the same problem everyone else faces: Proper tools are not available or are not available with a one-click install. Of the choices of not setting up a blog, writing code for a blogging system, installing something other than WordPress and installing WordPress with one click, the last option won. And just like everyone else discovers, once the system is running, there are fires to put out here and there.

The XHTML validation services that I know of are

Since I’m not serving (X)HTML5, Henri’s validator is not going to do me much good.

OK. I finally took the bait.

Relaxed has supported XHTML+SVG+MathML for some time now. I downloaded the schemas, merged the latest version of the MathML schema from upstream, ported exclusions to Schematron, removed target from Strict, disabled annotation-xml (the easiest way to get the schema working with RNG DTD compat enabled) and deployed. Compared to Relaxed, I also offer table integrity checking and some Charmod stuff.

This is a weekend hack: The new schema has undergone nearly zero quality assurance on my part. Anyway, there’s now a new preset. With Musings the target attribute gets caught. A Transitional schema exists but is not offered as a preset. Finally, here is configuration that silences all errors.

Full disclosure: I suspect my validation service may have an ill-formedness detection bug in the area of what characters exactly are allowed in XML element and attribute names. There’s a call to java.lang.Character.isLetter(), which smells like a Unicode versioning bug. I need to review the code at some point.

The way the feedparser works is that it looks for an ancestor element of <math> and <svg> respectively.

That’s the approach I’d like to see rather than the approach Hixie has suggested. Making the argument on the right fora is one of the many things I should get around to doing.

Posted by Henri Sivonen at

Hmm. Looks like it wasn’t just a problem with preview. :-(

Posted by Henri Sivonen at

XHTML 1.0 Strict, SVG 1.1, MathML 2.0

Do you really mean XHTML 1.0 Strict?

There are differences between that and XHTML 1.1.(Under the Modularization of XHTML, the actual XHTML 1.1 plus MathML 2.0 plus SVG 1.1 DTD includes the Target module.)

Otherwise, pretty good for a weekend’s work... :-)

The WHATWG faces the same problem everyone else faces: Proper tools are not available or are not available with a one-click install. Of the choices of not setting up a blog, writing code for a blogging system, installing something other than WordPress and installing WordPress with one click, the last option won.

There’s an alternative, somewhere between writing a new blogging system from scratch and meekly accepting what’s currently available. Namely, taking an existing blogging system and fixing it.

I don’t know that WordPress is the best starting point but, at least, you’d be free to distribute your  modified system.

Posted by Jacques Distler at

The way the feedparser works is that it looks for an ancestor element of <math> and <svg> respectively.

That’s the approach I’d like to see rather than the approach Hixie has suggested. Making the argument on the right fora is one of the many things I should get around to doing.

If somebody can point me in the right direction, I would be glad to help.

Hmm. Looks like it wasn’t just a problem with preview.

I’ve both corrected the posts, and found and fixed the bugs which caused the problem.  If it reoccurs, I am committed to doing the same again.

There’s an alternative, somewhere between writing a new blogging system from scratch and meekly accepting what’s currently available. Namely, taking an existing blogging system and fixing it.

+1

I guess the real question is whether or not the members of the WHATWG have any active interest in fixing this.

My problem is people who preach, but can’t be bothered themselves.  Operationally, the question isn’t whether or not there are temporary errors, but whether or not the errors are, in fact, only temporary.

Again, I’d be glad to help.

Posted by Sam Ruby at

In fairness to the WhatWG, the W3C QA blog is also published as faux XHTML (generated by MovableType). In their case, the hard work has been done. Because of licensing issues, however, the prospects for there ever being one-click installation are remote, indeed.

Posted by Jacques Distler at

But here is something we can agree on: the requirement that pages must be well formed XHTML is too high of a barrier for adoption for technologies such as MathML and SVG.

Authors capable of creating, and/or tools capapble of generating, markup as complex as MathML and SVG can’t generate WF XML?

I understand, or I think I understand, why WF XML is a bar that’s arguably too high for HTML content created by users with no inherent interest in markup, but for MathML and SVG applications too?

Posted by Norman Walsh at

Do you really mean XHTML 1.0 Strict?

I think I did. I am not sure if it is a good idea, though. :-)

There are differences between that and XHTML 1.1.(Under the Modularization of XHTML, the actual XHTML 1.1 plus MathML 2.0 plus SVG 1.1 DTD includes the Target module.)

I made a quick and dirty draft that adds ruby and target but doesn’t remove the attributes that should be removed.

There’s an alternative, somewhere between writing a new blogging system from scratch and meekly accepting what’s currently available. Namely, taking an existing blogging system and fixing it.

Lachlan Hunt has been fixing the WHATWG WordPress installation.

I don’t know that WordPress is the best starting point but, at least, you’d be free to distribute your  modified system.

I think that in the long term in general (beyond just blog.whatwg.org), it would be a better idea to invest into writing a new system instead of trying to fix WordPress. But I have other software to write right now.

That’s the approach I’d like to see rather than the approach Hixie has suggested. Making the argument on the right fora is one of the many things I should get around to doing.

If somebody can point me in the right direction, I would be glad to help.

This has been discussed on the WHATWG mailing list and in the mozilla.dev.tech.mathml  newsgroup.

I’ve both corrected the posts, and found and fixed the bugs which caused the problem.

Thanks. It looks like part of the bug remains.

My problem is people who preach, but can’t be bothered themselves.

I try to practice what I preach in the referenced document in software that I write—even in quick hacks. It would be nice if also the WordPress developers practiced what I preach. :-)

(Note that above I recounted how the WHATWG blog ended up running WordPress. I was not suggesting that the errors in the output are good.)

Posted by Henri Sivonen at

I still wonder whether including SVG inline is really the right thing to do.——You always use a data URI.

Intuitively, merging XML trees makes more sense than XML as Base64 in XML.

Posted by Henri Sivonen at

Authors capable of creating, and/or tools capapble of generating, markup as complex as MathML and SVG can’t generate WF XML?

I understand, or I think I understand, why WF XML is a bar that’s arguably too high for HTML content created by users with no inherent interest in markup, but for MathML and SVG applications too?

What makes you think that authors interested in creating MathML content have the slightest “inherent interest in markup,” know or care what the phrase “well-formed” means, or have the means or desire to create tools capable of generating well-formed XHTML?

For the most part, they don’t, nor should they. They ought to be able to install an off-the-shelf blogging system, download a tool, and be set to go.

Posted by Jacques Distler at

I understand, or I think I understand, why WF XML is a bar that’s arguably too high for HTML content created by users with no inherent interest in markup, but for MathML and SVG applications too?

First, do you consider VIM an SVG application?  That’s how I authored the SVG images you see on my weblog.

The world in which I would like to live is one where my daughter could view one of the ever-growing number of SVG images that are hosted on wikipedia, and could select-copy-paste one onto her MySpace template.  I would have no problem with the requirement that the SVG image itself be locally well formed, but the requirement that the enclosing page must also be well formed and served with a non-forgiving and not-universally supported MIME type is a non-starter.

Posted by Sam Ruby at

Regarding the WHATWG’s WordPress blog installation and the problems with it, I have attempted to get it to output HTML5 instead of XHTML, but the major problem is that there’s XHTML empty element syntax, string-based processing and other nasty surprises littered throughout the code, that it’s just just as simple as writing a new template for it.

It’s such a fundamental flaw in the way WordPress and many other CMSs have been built.  The best option is to write a CMS from scratch that uses real XML tools on the back end and serialises as HTML to output; but the amount of time and effort that would take to set up compared with the one-click installation of Word Press is just too much.

Although, I am working on such a CMS in my spare time and when it’s finished, I’ll be using it on my site and probably the WHATWG blog as well, but it’s going to take a while.

Posted by Lachlan Hunt at

The best option is to write a CMS from scratch that uses real XML tools on the back end and serialises as HTML to output

I don’t believe that a rewrite is required.

All that is required is the attention of some people who are interested in the problem of taking a stream of HTML5 bytes and using them to construct a DOM.  A real DOM.  One that can be processed using real XML tools.  Either that, or simply serialized into one of several recognizable serialization formats.

Does anybody here know of anybody who might be interested in that problem?

Posted by Sam Ruby at

First, do you consider VIM an SVG application?  That’s how I authored the SVG images you see on my weblog.

Most people just choose “Save As...” in a WYSIWYG tool.

I would have no problem with the requirement that the SVG image itself be locally well formed, but the requirement that the enclosing page must also be well formed and served with a non-forgiving and not-universally supported MIME type is a non-starter.

You’re describing XML Data Islands, which is basically how the MathPlayer Plugin supports MathML in IE/6. It handles embedded (well-formed) <math>...</math> fragments in an otherwise tag soup document.

The DesignScience guys decided to demand that the document be served as application/xhtml+xml. But I think that was mainly to piggyback on Gecko’s behaviour, so that they could rely on the MathML being well-formed

Posted by Jacques Distler at

I don’t know of anyone interested in making WordPress tackle that problem. I also don’t believe in “real XML tools” or “XML toolchains”. It’s easy to find systems that claim to be XML Toolchains, but I don’t think they have much in common with each other. Furthermore, none of them are as widely deployed as Movable Type or WordPress, so they don’t even have a user-hosted feasibility proof.

Systems that process syndication formats are a good example of HTML4/5 DOMs intertwingling with XML DOMs. Usually, they are ad-hoc object models, with boundaries denoted by a quoting mechanism like CDATA or string literals that don’t share a lexical environment with surrounding imperative code.

It’s not impossible to mix the two more fluidly. Try making an Atom feed with some type=html summaries and some type=xhtml summaries. View the feed preview in Firefox 2. Select all, and right-click “View Selection Source”.

Posted by Robert Sayre at

Yeah, same as with your nineteen favorite different feed formats. We stringify pieces of markup together and accept content from third parties without doing extensive character encoding checking. Somehow I got all that to work. Even in “XML.” Of course, I relied on a few things which I later discovered were browser bugs. Apparently you can’t really have “invalid characters” in XML.

Besides adoption by the number one user agent the variety of character encodings in use on the web (for lots of resources you even have to do extensive sniffing to determine it in the first place (and then you might be wrong)) are probably a big problem with adopting XML.

I wonder how much harder (or easier) it would be to implement a version of XML with error handling. So that, just as it is with HTML, every byte stream can be converted into an XML document.

Posted by Anne van Kesteren at

So my previous comment was in direct reply to the comment from Norman Walsh. With all the comments that missed while sitting in preview it now lacks some context.

Posted by Anne van Kesteren at

Jacques, you said: “In fairness to the WhatWG, the W3C QA blog is also published as faux XHTML (generated by MovableType). In their case, the hard work has been done.”

What is "faux XHTML"? That is the first time I read this expression. The QA Weblog is served with text/html and the document type is XHTML 1.0 Strict.  Just to have a precise discussion, it is always better to specify the version number of XHTML.

For the benefit of the WHATWG, I found that creating an HTML parsing model is a good thing. On the other side, I would prefer this parsing model used as a way to help tools to recover and fix documents more than saying it is the norm. Because it doesn’t improve anything. I think also HTML is much more important for the producers class of products (authoring tools, libraries) more than the consumers class of products (browsers, bots). In fact there are part of it on both sides. Browsers have a lot more impact on the CSS side of a document.

The article of Anne about the feedback on HTML and associated comments that I have seen  here and there on the Web shows that many people want XHTML and don’t want anymore to revert to HTML. That is another reality. Living in one community only makes the rest of the picture blur and dark. Then we have tendency to ignore that some other people have different needs because we do not see them. (Cavern Metaphor). It is not “victory”, it is just the beginning of working together. And these two last words have for me more meaning than anything else.

Posted by Karl Dubost, W3C at

Sam for discussion about MathML and namespaces in WebApps 1.0

Posted by Karl Dubost, W3C at

What is "faux XHTML"?

I mean XHTML served as text/html, and parsed as HTML by the User-Agent. There’s nothing particularly bad about “faux XHTML,” except that (despite Appendix C) it is, in practice, non-interoperable with “real XHTML” (XHTML, served as application/xhtml+xml, and parsed as XML by the User-Agent).

Because of this, some people (Anne, one of the more vocal among them) say, “Why even bother with 'faux XHTML'? Send HTML4 instead.” Certainly, in the XML-based future, envisioned by the W3C, “faux XHTML” was supposed to be a transitional format. New content was supposed to be “real XHTML.”

This vision has proven illusory, and one of the main reasons why is illustrated quite beautifully by your blog’s software.

Posted by Jacques Distler at

I wrote:

Certainly, in the XML-based future, envisioned by the W3C, “faux XHTML” was supposed to be a transitional format. New content was supposed to be “real XHTML.”

Before anyone jumps on me, note that I meant “transitional,” both in the sense of old content and in the sense of old User-Agents (via content-negotiation).

Posted by Jacques Distler at

Not illusory, it is possible to serve as application/xhtml+xml and only in application/xhtml+xml. I do it on one web site. It might be difficult for some user agents.

It is already supported by many user agents, the picture is not as bad as people want to claim it. It’s always better to have practical data with tests. There are tests for XHTML entities too and the results.

Posted by Karl Dubost, W3C at

It might be difficult for some user agents.

“Despite the best that has been done by everyone . . . the war situation has developed not necessarily to our advantage.”

Emperor Hirohito
Radio Broadcast Announcing Japan’s Surrender
August 15, 1945

(apologies to Billmon)

Posted by Mark at

I am well-aware of the status of User-Agent support for XHTML. I was not talking about User-Agent support.

To see what I am talking about, might I suggest adding the following to the .htaccess file of your /QA directory:

Options FollowSymLinks
RewriteEngine On
RewriteBase /QA/
RewriteRule ^(.*)/(.*)/$ $1/$2/index.html
RewriteRule ^$ index.php
RewriteCond %{HTTP_ACCEPT} application\/xhtml\+xml
RewriteRule \.html$|\.php$   - [T=application/xhtml+xml]
RewriteRule mt.cgi|mt-comments.cgi - [E=HTTP_CONTENT_TYPE:application/xhtml+xml]

and making the following one-line addition (either use patch, or manually add the line marked by the “+") to QA/sununga/lib/MT/App.pm.

--- QA/sununga/lib/MT/App.pm.orig  2006-06-22 06:07:01.000000000 -0500
+++ QA/sununga/lib/MT/App.pm       2006-06-25 00:13:15.000000000 -0500
@@ -101,6 +101,7 @@
 sub send_http_header {
     my $app = shift;
     my($type) = @_;
+    if ($ENV{'HTTP_CONTENT_TYPE'}){$type= $ENV{'HTTP_CONTENT_TYPE'};}
     $type ||= 'text/html';
     if (my $charset = $app->charset) {
         $type .= "; charset=$charset"

There! That was easy.

Now your "XHTML 1.0 Strict” blog is served with the proper MIME-type to compatible User-Agents, and as text/html to Internet Explorer.

Posted by Jacques Distler at

“What difference does it make to the dead, the orphans and the homeless, whether the mad destruction is wrought under the name of totalitarianism or the holy name of liberty or democracy?” - Mahatma Gandhi.

It is always interesting to see that some people always choose the conflict.

Jacques: Thanks! I’m aware of this solution. No need to modify the source code of MT, we do not use it for delivering the content, just to manage the content. All W3C files are under CVS so everything, you receive from the QA Weblog is a static file, except for the home page which has a semi-dynamic content include coming from the Mailing-Lists.

Posted by Karl Dubost, W3C at

I don’t believe that a rewrite is required. All that is required is the attention of some people who are interested in the problem of taking a stream of HTML5 bytes and using them to construct a DOM.  A real DOM.  One that can be processed using real XML tools.  Either that, or simply serialized into one of several recognizable serialization formats.

I take it that you mean implementing the HTML5 parsing algorithm and at least DOM Level 2 Core in PHP4, right? Would this component sit in front of the usual WordPress output code, parse the soup and reserialize it? (It might make more sense to implement XOM in PHP than to implement the DOM with its design issues.)

Posted by Henri Sivonen at

I take it that you mean implementing the HTML5 parsing algorithm and at least DOM Level 2 Core in PHP4, right? Would this component sit in front of the usual WordPress output code, parse the soup and reserialize it?

What I would most like to see is something like this.

Such a function would take in a string and produce a string.  The guarantee is that the output string would produce the same DOM as the input string.  Ideally, users could select from a number of different serialization formats.

This could be applied globally to the template as you suggest, or could simply be made available as a function that people could use in combination with functions like the_content.

If such a function were implemented in C, it could easily be wrapped by a number of languages.

Posted by Sam Ruby at

Such a function would take in a string and produce a string.  The guarantee is that the output string would produce the same DOM as the input string.  Ideally, users could select from a number of different serialization formats.

Note that the HTML5 parsing algorithm does not guarantee a conforming DOM, so making the output conforming would require further work which would be dataloss from the point of view of the HTML5 parsing algorithm.

This could be applied globally to the template as you suggest, or could simply be made available as a function that people could use in combination with functions like the_content.

The HTML5 parsing algorithm operates on the entire document. Performing parts of it piecewise does not guarantee the right result. Moreover, callers of the_content cannot be trusted to get things right. Therefore, I think the best approach would be using PHP’s output buffering to catch the entire document from WordPress for sanitizing it as the last thing that happens before the script terminates.

If such a function were implemented in C, it could easily be wrapped by a number of languages.

The problem with C-based PHP extensions is that the people who have the access and time to install C-based extensions are the ones who could as well upgrade to e.g. Python. The sanitizer would need to be written in pure PHP4 in order to be usable wherever WordPress is.

Posted by Henri Sivonen at

No need to modify the source code of MT, we do not use it for delivering the content, just to manage the content.

Yes, I noticed that you eliminated the “Comment-Preview” page, so no need to deliver it as application/xhtml+xml. I thought you still might want to deliver the MT Admin interface as real XHTML, but perhaps not ... ;-)

All W3C files are under CVS

I fail to see the significance of this remark.

Anyway, I think you and your colleagues will find the above exercise an illuminating one. Let me know how it works out ...

Posted by Jacques Distler at

The problem with C-based PHP extensions is that the people who have the access and time to install C-based extensions are the ones who could as well upgrade to e.g. Python.

I don’t understand what Python has to do with this discussion, but in any case, I presume that somebody has access to the blog.whatwg.org site.

Instead of trying to boil the ocean, if we can momentarily discuss getting that one site’s HTML and feed(s) to be valid, I will assert that we need not wait for the fruition of a microkernel based rewrite of the operating system sized effort for this to occur.

One could start with the existing templates and fix the parts of them that are universally broken.  I see evidence that this work was started but never completed.  At which point, access to the site could be limited to people who can be trusted to reliably produce content that was valid relative to the format selected.  One could even conceivably augment this with tools like validators that provide feedback, one could even augment it with tools that correct common errors and/or automatically converts from one format to another; but strictly speaking neither are precisely required.

The alternative is to decide that standards conformance is not important.  Many take such a position.  My only argument is that I believe that such a position is self-defeating when adopted by people who actively contribute to the standard being proposed.

Posted by Sam Ruby at

Sam Ruby: Feedback on XHTML

[link]...

Excerpt from del.icio.us/lachlan.hunt/whatwg at

The HTML5 parsing algorithm operates on the entire document.

It’s easy enough to wrap the_content in a minimal (X)HTML5 document and apply the parsing algorithm to that. In 90% of cases, re-serializing the sub-tree of <body> would produce the desired effect.

Moreover, callers of the_content cannot be trusted to get things right.

No, but fixing that problem is a one-time investment. After that, you only need to fix the_content.

I think Sam is lobbying for y’all to make that one-time investment. Lachlan is on the case. But, because of WordPress’s architecture, it’s a much bigger job than it ought to be.

As much as I love to complain about MovableType, it does, at least, feature a (nearly) clean separation of code and markup. Editing a few templates is infinitely easier than wading through thousands of lines of spaghetti code.

That still leaves the matter of fixing  the_content for another day ...

Posted by Jacques Distler at

Instead of trying to boil the ocean, if we can momentarily discuss getting that one site’s HTML and feed(s) to be valid

OK. But then the DOM stuff is probably out of the scope of the discussion.

One could start with the existing templates and fix the parts of them that are universally broken.  I see evidence that this work was started but never completed.

Or, alternatively, a catch-all fixing function could be used as the output handler. Probably easier.

At which point, access to the site could be limited to people who can be trusted to reliably produce content that was valid relative to the format selected.

Part of the point of running WP is having comments and pings enabled.

One could even conceivably augment this with tools like validators that provide feedback, one could even augment it with tools that correct common errors and/or automatically converts from one format to another; but strictly speaking neither are precisely required.

The main problems so far identified include:

Oh, and this is just WP—not MediaWiki.

Posted by Henri Sivonen at

No, but fixing that problem is a one-time investment.

Presuming that this can either be structured as a plugin, or you can get the WorkPress folks to accept a patch.

I think Sam is lobbying for y’all to make that one-time investment.

Primarily because, as you said, such an effort would be quite illuminating.

One potential outcome of such an activity is for the WHATWG to decide that, for HTML5, encountering U+002F SOLIDUS in states such as the Tag name state and Before attribute name state is not a parse error.

Posted by Sam Ruby at

Ah, that thing. It turns out that GtkMozEmbed is very active these days. I think a big branch landed today. I also took an action to unify our sanitizers and serializers for 1.9. Another Mozilla developer is working on a scriptable HTML parser for Firefox 3 (DOMParser for HTML).

Posted by Robert Sayre at

FWIW, I have converted the default WordPress theme to HTML, including modifying all the PHP code to output HTML instead of XHTML. I can make it available if anyone wants it. Link is below:

Posted by Nicholas Shanks at

Jacques: about enlightning, it is not very new… [link]

for CVS, my point is that MT is not installed on the same machine, files are generated on another space and then cvs committed to the right machine. For what is worth, the .htaccess is not enough, because it assumes that files under the directory are of the same doctype, which is not the case. But I will pursue a solution. I have first to identify the information map in terms of URIs and their qualifications. Interesting exercise.

1. identify the map of URIs
2. identify their doctype and associated mime-type
3. send them appropriately.

(kudos: files are not similar to the name at the end of an URI. It is something orthogonal.)

Posted by Karl Dubost, W3C at

Jacques: about enlightning, it is not very new… [link]

That wasn’t what I thought you would find enlightening.

for CVS, my point is that MT is not installed on the same machine, files are generated on another space and then cvs committed to the right machine. For what is worth, the .htaccess is not enough, because it assumes that files under the directory are of the same doctype, which is not the case.

1. mod_rewrite does not map files, it maps URLs. So it doesn’t matter where the files physically reside, how they got there, nor any other details of how URLs are mapped to filenames.
2. The instructions I gave assumed that all .html and .php documents beneath http://w3.org/QA/ are XHTML documents, which you would like to serve with the correct MIME-type to compatible browsers. If that assumption is incorrect, you will need to adjust the rewriting rules appropriately.

Posted by Jacques Distler at

Jacques: Hehe. I know all of that. ;) I do not know if the comments do not convey tones in both directions. :)
1. yes, I know :) we use it for many things already.
2. yes it’s why I talked about mapping the information structure first. Which is one of the things which is missing in many tools for managing a Web site, an interface to manage the URI space. The only one I know is Jigsaw which really look at  a Web site in terms of resources.  There may be others, and I would be very happy to know about them.

For illuminating, would you mean that each time a page becomes invalid for a reason X or Y, browsers don’t show them anymore (when served with the correct mime-type: application/xhtml+xml). If this is it, yes. I found it very practical in many cases as a step for control of the page. I use it on another site. It illustrates also that many softwares do not put controls, before publishing, on the type of content which is being published. There is room for improvement here almost everywhere.  It’s why, I’m always very interested by all mechanism which helps the user and the software to notify and improve the content.

Sam says it above, his daughter should just be able to cut and paste and not go through geeks sorcery. It’s why also I think that any efforts of creating a format must include the producers class of products of this format (authoring tool, libraries, etc.) The rendering is cool but it is somehow too late.

And I repeat it in case it was not clear: Defining an HTML Parsing model would be a very good thing. It would help on many fronts, not only browsers, but also repairing tools.

Posted by Karl Dubost, W3C at

One could start with the existing templates and fix the parts of them that are universally broken.  I see evidence that this work was started but never completed.

Sam, I’ve fixed all the mistakes in the WHATWG’s WordPress templates.  The major remaining issue is that there is XHTML empty element syntax “/&gt;” littered throughout the core of the code (a quick search revealed 947 of them!), but I’m reluctant to go through and edit those files because those changes will just get over written at the next upgrade.

Thus, they still get inserted whenever anyone enters a single line break in a comment (which gets converted to <br />) and there’s still one remaining <link /> element remaining because it’s not in the template like the others, but rather inserted by the wp_head(); function called from header.php.

We’ve have now fixed the feeds to make it output valid Atom 1.0, instead of RSS, thanks to a plugin we found, so we are making progress.  We just need to find or make a plugin fixes these and the other entity reference and character encoding issues that Henri mentioned in a previous comment.

Posted by Lachlan Hunt at

Lachlan: excellent!

I’m going to try to make the case, in a separate post, that HTML5 should allow for the empty element syntax.

Posted by Sam Ruby at

Jacques: Hehe. I know all of that. ;) I do not know if the comments do not convey tones in both directions. :)

It’s true that comments are too low-bandwidth to convey many of the subtleties and nuances of human conversation.

So I’m glad we understand each other ...

yes it’s why I talked about mapping the information structure first.

I took a 5-minute tour around your site, and jotted down the URL structure that I saw. Doubtless, the actual URL structure is a bit more complicated than what I saw in my brief tour, so the corresponding rewriting rules will surely be more complicated.

It illustrates also that many softwares do not put controls, before publishing, on the type of content which is being published.

Particularly important if the “wrong” content means a fatal parsing error on the receiving end.

Have fun converting the QA blog to “real” XHTML.

Inspired by this conversation, I decided to release MTValidate 0.4, which is what I use to exert the kind of control you are talking about. As with previous versions, it uses a local copy of the W3C Validator. But it also uses XML::LibXML (libxml2) to ensure well-formedness.

And I repeat it in case it was not clear: Defining an HTML Parsing model would be a very good thing. It would help on many fronts, not only browsers, but also repairing tools.

It will vastly simplify the life of browser developers. Rather than having to waste their time reverse-engineering the error-correction algorithms of their competitors, they will be able to devote their energies to more productive efforts.

And, yes, a library based on that parsing model, will permit the deployment of error-correction filters, along the lines Sam and Henri were talking about. But there already exist such tools (Sam uses one in Venus). When it comes to error-correction, there’s an inevitable sense that one can do a better job of recovering the author’s intention, with “a little tweak here and a little tweak there.” So I’m not sure that an error-correcting parser, based on the HTML5 parsing model would “win” in the marketplace.

Posted by Jacques Distler at

WHAT Working Group boils Ocean!

Did you ever read Feedback on XHTML [November 25, 2006] on Interwingly? It’s very heady. It was a reply to “I hope we can all agree that sending XHTML as application/xhtml+xml is silly” statement by Anne van Kesteren [November 24, 2006]....

Excerpt from The Elementary Group Standards at

The Beginning of the End

HTML-safe generation of embedded MathML....

Excerpt from Musings at

The Beginning of the End

HTML-safe generation of embedded MathML.... [more]

Trackback from Musings

at

Sam Ruby: Feedback on XHTML

[link]...

Excerpt from del.icio.us/lachlan.hunt/whatwg at

I posted a bit of this thread to the WP-Hackers mailist, and got this reponse from Alex Günsche

"

It is true that there are XHTML snippets hardcoded in the WP core. But
it would be a matter of a couple hours to find and replace that code.
(This would also be required on each upgrade.) Then you would need time
to create the template, which can also range from a couple hours when
taking and adapting a simple existing theme, up to a week and more when
creating a new and sophisticated theme.

"

Posted by Lawrence Krubner at

Version 5 of HTML is coming soon

I am struggling to accept what Anne van Kesteren announces as the future of the web: People are slowly starting to realize what HTML5 means and start arguing about individual things: * No SGML * No DOCTYPE * No formal schema * No versioning *...

Excerpt from Closer To The Ideal at

Add your comment