intertwingly

It’s just data

The URL Mess


tl;dr: shipping is a feature; getting the URL feature well-defined should not block HTML5 given the nature of the HTML5 reference to the URL spec.

This is a subject desperately in need of an elevator pitch.  From my  perspective, here are the three top things that need to be understood:

1) From an HTML5 specification point of view, there is no technical difference between any recent snapshot of the WHATWG specification and anything that the WebApps Working group publishes in the upcoming weeks.

2) The URL spec (from either source, per above it doesn’t matter) is as backwards compatible to rfc3986 + rfc3987 as HTML5 is to HTML4; which is to say that it is not.  There are things that are specified by the prior versions of the specs that were never implemented or are broken or don’t reflect current reality as implemented by contemporary web browsers.

3) Some (Roy Fielding in particular) would prefer a more layered approach where an error correcting parsing specification was layered over a data format; much in the way that HTML5 is layered over DOM4.

Analysis of points 1, 2, 3 above.

1) What this means is that any choice between WHATWG and W3C specs is non-technical.  Furthermore, any choice to wait until either of those reaches an arbitrary maturity level is also non-technical.  It doesn’t make any sense to bring any of these discussions back to the HTML WG as these decisions will ultimately be made by W3C Management based on input from the AC.

2) In any case where the URL spec (either one, it matters not) differs from the relevant RFCs, from an HTML point of view the URL specification is the correct one.  This may mean that tools other than browsers may parse URIs differently than web browsers do.  While clearly unfortunate, this likely will take years, and possibly a decade or more, to resolve.

3) If somebody were willing to do the work that Roy proposes, it could be evaluated; but to date there are quite a few parties that have good ideas in this space but haven’t delivered on them.

Background data:

RFC 3986 provides for the ability to register new URI schemes; the WHATWG/W3C URL specification does not.  URIs that depend on schemes not defined by the URL specification would therefore not be compatible.  Anne has indicated a willingness to incorporate specifications that others may develop for additional schemes, however he has also indicated that his personal interest lies in documenting what web browsers support.

Meanwhile, this is a concrete counter example to the notion of the URL specification being a strict superset of rfc3986 + rfc3987.  Producers of URLs that want to be conservative in what they send (in the Postel sense), would be best served to restrict themselves to the as of yet undefined intersection between these sets of specifications.

Recommendations:

While I am optimistic that at some point in the future the W3C will feel comfortable referencing stable and consensus driven specifications produced by the WHATWG, it is likely that some changes will be required to one or both organizations for this to occur; meanwhile I encourage the W3C to continue on the path of standardizing a snapshot version of the WHATWG URL specification, and for HTML5 to reference the W3C version of the specification.

Furthermore, there has been talk of holding HTML5 until the W3C URL specification reaches the Candidate Recommendation status.  I see no basis in the requirements for Normative References for this.  HTML5’s dependence on the URL specification is weak, and an analysis of the open bugs has been made, and a determination has been made that those changes would not affect HTML5.  Furthermore the value of a “CR” phase for a document which is meant to capture and catch up to implementations is questionable.  Finally, waiting any small number of months won’t address the gap between URLs as implemented by web browsers and URIs as specified and used by formats such as RDF.

Should a more suitable (example: architecturally layered) specification become available in the HTML 5.1 time-frame, the HTML WG should evaluate its suitability.

References: