Ian Hickson: If we truly want to make authors have better tools for making their content more compliant, a start would be having the W3C invest more genuinely in its validators. The W3C HTML Validator is one of the user agents that ignores the Content-Type header when it comes to HTML vs XHTML; filed as bug 1500 about a year ago, still unfixed.
Saying that the validator is somebody else’s problem only works if you don’t believe that the problem is important.
If people feel that HTML 5 deserves a better validator, I’m willing to invest some time into coding. Are there others interesting in contributing to the coding, the writing up of test cases, or the authoring of documentation?
Another thing that will be needed down the road is somebody to host it. I host the Feed Validator and that gets plenty enough traffic as is, I can only imagine what kind of traffic an HTML validator would have.
I think the W3C Validator has a few other bugs that I continue to hit over the years. But I don’t think hosting should be that much of an issue. I’d be surprised if no one came forward. Perhaps the Web Standards Project would have some leads.
I’d be happy to assist with both test cases and documentation authoring. I still have test case files I built for IE5.5 way back when.
I’d love to write test cases as well as writing code for the validator itself, if it was open source and contributing was as easy as saying “pie”. I can’t help with the hosting, though, but I feel that the employers as well as W3C member companies of some of this blog’s readers (and even owner) perhaps could do something to help? :-)
Anyway, while speaking of content type and validators, why can’t we have one submission form (e.g. validation front page) for any kind of validation? If the validator can sniff what content type the resource at the end of a given URI has, it can then invoke the correct validator. application/xhtml+xml invokes the XHTML validator, application/atom+xml invokes the Atom validator, text/html invokes the HTML validator, text/css invokes the CSS validator and so on.
Unfortunately, validator.org is already taken by a domain shark, but we might perhaps think of a domain name suitable for this task that isn’t reserved just yet.
I can help with any design aspects and xhtml (id love to know how to make a validator, but for now I’m still writing the xhtml) I’d love to take part in a project like this.
By all means, you guys continue to bikeshed. As for me, I’m waiting for an expression of interest by the HTML 5 community.
From my perspective:
A markup specification as lengthy as HTML 5 is effectively only a guide without a validator, as few will be able to grok it in its entirety. A validator can be an important part of a feedback loop which will cause users to report areas of the spec that they don’t fully understand or are prone to causing common usage errors. This feedback can be entirely automatic.
The Feed Validator started out life as an RSS 2.0 validator that also happens to be helpful for RSS 0.91 and RSS 0.92 feeds. To this day, it will report RSS 0.91 feeds which do not contained required item title elements to be valid, and will report RSS 0.92 feeds which contain neither item descriptions nor item titles as invalid. As HTML 5 is destined to be neither a fully compliant SGML grammar nor a fully compliant XML grammar, a parser specifically designed for HTML 5 is in order. The Feed Validator will also do an additional RDF/XML validity check for RSS 1.0 feeds; an HTML validator could do similarly for XHTML.
I haven’t looked closely at the existing validator, beyond determining that it appears to be legacy code. This coupled with the knowledge that few seem interested in maintaining it leads me to conclude that a new effort is warranted.
All in all, If there is interest and participation, I believe that a reasonably useful validator could be built this fall, and a could be pretty much fully-functional, stable, and maintained by a self-sustaining community by year end.
I was a bit amazed that WebValidator.org was available, so I’ve now registered it if we ever come to the step of putting up a new validator. The domain is currently hosted at DreamHost on a shared host, so I don’t think it will handle the pressure of such a service, but it can be hosted there until anyone else has something better to offer.
My upcoming master’s thesis is tentatively titled “A Conformance Checking Service for Web Applications 1.0 Documents”. That is, the goal is to write a conformance checker for HTML5 to the extent the spec is ready at the time I need to wrap up and graduate. (Of course, it would be nice to update the service later when HTML5 is done.)
My HTML5 conformance checker is a special case of my Validation Service for RELAX NG. However, I fully realize (and have realized from the outset) that there is no schema language that can fully describe the conformance requirements of HTML5. The plan is to express everything that is convenient to express in RELAX NG in RELAX NG. Of what is left the plan is to use Schematron for everything for which Schematron is convenient. For the rest, the plan is to use a Turing-complete language—in my case Java. When it makes sense to glue RELAX NG and the Turing-complete language by implementing a datatype library, the plan is to do so.
I didn’t write the schemas from scratch. The person in charge of the schema project is fantasai, who wrote the bulk of “HTML5 Core”. (Note that the modularization choices are not Hixie-endorsed.) I have contributed stuff outside the Core including Web Forms (both 1.0 and 2.0).
The status of the project hasn’t changed since early May, because I got a contract for working on Firefox. However, the contract runs out in a couple of weeks after which I intend to take the thesis work out of the freezer and continue it. It turns out that the WHAT WG work has focused on non-syntax matters over the summer, so the break happened to be well-scheduled.
The known bugs / unimplemented features as of May are documented.
I have used test cases by Anne van Kesteren and fantasai. I’d love to have more test cases.
The architecture of the software is well-suited for supporting HTML 4.01 and XHTML 1.x as well. I just haven’t gotten around to incorporating the HTML5 datatypes in those schemas or adding XHTML+MathML+SVG to the preset list (originally due to legal reasons but later due to being busy doing other things).
I have been unable to find an automated regression test suite in your source code. Did I miss it?
There’s no automated test suite for the front end. There are, however, automated tests for the back end. The code is in the schema CVS repository. (Test driver. Driver for setting up the driver.)
Note that a very large chunk of testing is based on Anne van Kesteren’s Web Forms 2.0 test suite as patched by me and those files aren’t in the CVS repo.
Why think of building a new HTML validator, when there are plenty in development already, and many of them open source? There’s Henri's, mentioned above, there’s the W3C’s, well in need of some love and help from the community it has served for all these years, there’s also relaxed, with some nice technical aspects as well, and quite a few others.
So, is there a need for a new HTML validator? I don’t think so. From my perspective, there is a need for a stronger belief in open source and working together, however rewarding the idea that “I can do better on my own” may be.
Sam: I’m not attributing you motives, just pointing that unless you really want to start from scratch, there are plenty of tools to which you could participate. Whether they fit your criteria and taste and desires is up to you.
W3C Validator may have some bugs but it has got high priority for coding validations and you may see many websites with very poor ranking if they have W3C Validation errors. Its always recommended to get the coding sorted as per errors displayed.