The only checks that are made for now is for attribute values. Attribute with values that are unquoted are flagged if either the Pedagogical or Polyglot profile is selected. Additionally, attributes without values at all will be flagged if Polyglot is selected. To see this live look for nowrap in the output for google.com.
What the UI should look like, and what checks should be made are all open for discussion. Neither the coding nor the deployment is an issue, what is needed is somebody to step up and to decide what profile(s) they feel are needed, and what checks should be made in that profile. I find such discussions to be time sinks, one that I can ill afford at the present time.
If people want me to continue to contribute via coding, this input needs to be in the form of test cases. HTML vs XHTML is an excellent source for ideas. I intend to support any and all sincere efforts at defining a profile.
You most probably meant “nowrap” instead of “nowarn” ;-)
I have been trying to make HTML5 better for education by participating in the HTML5 effort at WHATWG for a few years. Recently also joined the W3C HTML5 Working Group . One of the things that might come out of this effort is an option in the HTML5...
I guess Pedagogical would also be suitable for debugging purposes.
I think Pedagogical should include warning for all optional tags except for tbody and colgroup (maybe warn if just one of start and end is included); it should warn about any unquoted attribute values. I think nothing else should be added to this profile.
Polyglot needs to be complete in order to be useful for its intended purpose. I think the list is, in addition to Pedagogial:
- No minimized attributes
- Need xml:lang when lang is present.
- Need xmlns on root
- Need xmlns on all HTML elements whose parent is not an HTML element
- Need xmlns on svg and math elements
- Need xmlns:xlink in scope for any xlink:foo attributes
- Either: <script>/<style> does not contain any “<” or “&” or “]]>”, or: script/style is CDATA-section-escaped. For <script>, the first non-whitespace line needs to match (a):
...and the last non-whitespace line
where the sting “]]>” does not appear in between.
For <style>, it’s the same as above but without (a) and (c).
- No form feed (including in NCR)
- The “DOCTYPE” part in the doctype needs to be all-uppercase
- The “SYSTEM” part of the compat doctype needs to be all-uppercase
- The “PUBLIC” part of the legacy doctypes needs to be in all-uppercase
- The HTML4 doctypes cannot be used (though they already emit a warning)
- void elements need the slash
- No “<” in RCDATA elements
- No unescaped ampersands
- No “]]>” in character data except in <script> and <style>. (It’s ok in attribute values and comments.)
- HTML elements and attribute names use all-lowercase
- SVG and MathML elements and attributes use the canonical case
- No entities except XML’s five ones
- No line feed immediately following <pre> or <textarea> start tag (also as NCR)
- No <noscript>
- No content in <iframe>.
- Line feeds and tabs need to be written as NCRs in attribute values
- Need explicit <tbody> and </tbody> tags
- If HTTP didn’t specify encoding, need UTF-8 or UTF-16
- No sequences of bytes that are “misinterpreted for compatibility”
- No whitespace between <html> and <head>
- No whitespace after </body> or after </html>
- Text, comment data, and attribute values need to match XML 1.0 4ed Char. (Not sure if this is the case already for conforming documents, ignoring form feed.)
This still leaves differences in scripts and style sheets; if those are present, maybe emit a message saying that it wasn’t checked for polyglotness, with a link explaining the differences.