It’s just data

First Polyglot Validator Check Deployed

... to Validator.nu.  Look for the Profile option.

The only checks that are made for now is for attribute values.  Attribute with values that are unquoted are flagged if either the Pedagogical or Polyglot profile is selected.  Additionally, attributes without values at all will be flagged if Polyglot is selected.  To see this live look for nowrap in the output for google.com.

What the UI should look like, and what checks should be made are all open for discussion.  Neither the coding nor the deployment is an issue, what is needed is somebody to step up and to decide what profile(s) they feel are needed, and what checks should be made in that profile.  I find such discussions to be time sinks, one that I can ill afford at the present time.

If people want me to continue to contribute via coding, this input needs to be in the form of test cases.  HTML vs XHTML is an excellent source for ideas.  I intend to support any and all sincere efforts at defining a profile.


You most probably meant “nowrap” instead of “nowarn” ;-)

Posted by Thomas Broyer at

Fixed.  Thanks!

Posted by Sam Ruby at

Bug 644 has been fixed and deployed to qa-dev.w3.

Posted by Sam Ruby at

Quick thoughts:

1. Interesting: Looking at the interface I see “pedagogical” as an option. Is it described somewhere?

I volunteer to draft a paper for the W3C about this choice!

2. When I validate as “polyglot” or “pedagogical” my choice is not kept in the form after submission. My initial reaction was to think I did something wrong.

Posted by Lars Gunther at

Lars:

  1. I think you have this backwards.  I’ve created an empty slot.  You drafting a paper (ideally with hint, hint test cases) is what will define this option.  :-)
  2. See the comment above yours: the bug has been reported, fixed, and deployed to the W3C “qa-dev” site.  The next time validator.nu is updated, this fix will be picked up.
Posted by Sam Ruby at

You will get testcases in a few days!

Posted by Lars Gunther at

Pedagogic validation of HTML

I have been trying to make HTML5 better for education by participating in the HTML5 effort at WHATWG for a few years. Recently also joined the W3C HTML5 Working Group . One of the things that might come out of this effort is an option in the HTML5...

Excerpt from Itpastorn's Thinkpad update & maintenance blog at

I guess Pedagogical would also be suitable for debugging purposes.

I think Pedagogical should include warning for all optional tags except for tbody and colgroup (maybe warn if just one of start and end is included); it should warn about any unquoted attribute values. I think nothing else should be added to this profile.

Polyglot needs to be complete in order to be useful for its intended purpose. I think the list is, in addition to Pedagogial:

- No minimized attributes
- Need xml:lang when lang is present.
- Need xmlns on root
- Need xmlns on all HTML elements whose parent is not an HTML element
- Need xmlns on svg and math elements
- Need xmlns:xlink in scope for any xlink:foo attributes
- Either: <script>/<style> does not contain any “<” or “&” or “]]>”, or: script/style is CDATA-section-escaped. For <script>, the first non-whitespace line needs to match (a):
/^\s*\/\/\s*<\!\[CDATA\[\s*$/
...and the last non-whitespace line
/^\s*\/\/\s*\]\]>\s*$/
or (b):
/^\s*\/\*\s*<\!\[CDATA\[\s*\*\/\s*$/
...and
/^\s*\/\*\s*\]\]>\s*\*\/\s*$/
or (c):
/^\s*<\!--\/\/--><\!\[CDATA\[\/\/><\!--\s*$/
...and
/^\s*\/\/--><\!\]\]>\s*$/
or (d):
/^\s*<\!--\/\*--><\!\[CDATA\[\/\*><\!--\*\/\s*$/
...and
/^\s*\/\*\]\]>\*\/-->\s*$/
where the sting “]]>” does not appear in between.
For <style>, it’s the same as above but without (a) and (c).
- No form feed (including in NCR)
- The “DOCTYPE” part in the doctype needs to be all-uppercase
- The “SYSTEM” part of the compat doctype needs to be all-uppercase
- The “PUBLIC” part of the legacy doctypes needs to be in all-uppercase
- The HTML4 doctypes cannot be used (though they already emit a warning)
- void elements need the slash
- No “<” in RCDATA elements
- No unescaped ampersands
- No “]]>” in character data except in <script> and <style>. (It’s ok in attribute values and comments.)
- HTML elements and attribute names use all-lowercase
- SVG and MathML elements and attributes use the canonical case
- No entities except XML’s five ones
- No line feed immediately following <pre> or <textarea> start tag (also as NCR)
- No <noscript>
- No content in <iframe>.
- Line feeds and tabs need to be written as NCRs in attribute values
- Need explicit <tbody> and </tbody> tags
- If HTTP didn’t specify encoding, need UTF-8 or UTF-16
- No sequences of bytes that are “misinterpreted for compatibility”
- No whitespace between <html> and <head>
- No whitespace after </body> or after </html>
- Text, comment data, and attribute values need to match XML 1.0 4ed Char. (Not sure if this is the case already for conforming documents, ignoring form feed.)

This still leaves differences in scripts and style sheets; if those are present, maybe emit a message saying that it wasn’t checked for polyglotness, with a link explaining the differences.

Posted by zcorpan at

Another useful resource is the Useful Validator Warnings requests page.  Though, it’s still a work in progress, like the HTML vs. XHTML page.

Posted by Lachlan Hunt at

This Week in HTML5 – Episode 35

Topics this week include keygen, dialog, and the pros and cons of including examples in specs....

Excerpt from The WHATWG Blog at

Add your comment