It’s just data

The URL Mess

tl;dr: shipping is a feature; getting the URL feature well-defined should not block HTML5 given the nature of the HTML5 reference to the URL spec.

This is a subject desperately in need of an elevator pitch.  From my  perspective, here are the three top things that need to be understood:

1) From an HTML5 specification point of view, there is no technical difference between any recent snapshot of the WHATWG specification and anything that the WebApps Working group publishes in the upcoming weeks.

2) The URL spec (from either source, per above it doesn’t matter) is as backwards compatible to rfc3986 + rfc3987 as HTML5 is to HTML4; which is to say that it is not.  There are things that are specified by the prior versions of the specs that were never implemented or are broken or don’t reflect current reality as implemented by contemporary web browsers.

3) Some (Roy Fielding in particular) would prefer a more layered approach where an error correcting parsing specification was layered over a data format; much in the way that HTML5 is layered over DOM4.

Analysis of points 1, 2, 3 above.

1) What this means is that any choice between WHATWG and W3C specs is non-technical.  Furthermore, any choice to wait until either of those reaches an arbitrary maturity level is also non-technical.  It doesn’t make any sense to bring any of these discussions back to the HTML WG as these decisions will ultimately be made by W3C Management based on input from the AC.

2) In any case where the URL spec (either one, it matters not) differs from the relevant RFCs, from an HTML point of view the URL specification is the correct one.  This may mean that tools other than browsers may parse URIs differently than web browsers do.  While clearly unfortunate, this likely will take years, and possibly a decade or more, to resolve.

3) If somebody were willing to do the work that Roy proposes, it could be evaluated; but to date there are quite a few parties that have good ideas in this space but haven’t delivered on them.

Background data:

RFC 3986 provides for the ability to register new URI schemes; the WHATWG/W3C URL specification does not.  URIs that depend on schemes not defined by the URL specification would therefore not be compatible.  Anne has indicated a willingness to incorporate specifications that others may develop for additional schemes, however he has also indicated that his personal interest lies in documenting what web browsers support.

Meanwhile, this is a concrete counter example to the notion of the URL specification being a strict superset of rfc3986 + rfc3987.  Producers of URLs that want to be conservative in what they send (in the Postel sense), would be best served to restrict themselves to the as of yet undefined intersection between these sets of specifications.

Recommendations:

While I am optimistic that at some point in the future the W3C will feel comfortable referencing stable and consensus driven specifications produced by the WHATWG, it is likely that some changes will be required to one or both organizations for this to occur; meanwhile I encourage the W3C to continue on the path of standardizing a snapshot version of the WHATWG URL specification, and for HTML5 to reference the W3C version of the specification.

Furthermore, there has been talk of holding HTML5 until the W3C URL specification reaches the Candidate Recommendation status.  I see no basis in the requirements for Normative References for this.  HTML5’s dependence on the URL specification is weak, and an analysis of the open bugs has been made, and a determination has been made that those changes would not affect HTML5.  Furthermore the value of a “CR” phase for a document which is meant to capture and catch up to implementations is questionable.  Finally, waiting any small number of months won’t address the gap between URLs as implemented by web browsers and URIs as specified and used by formats such as RDF.

Should a more suitable (example: architecturally layered) specification become available in the HTML 5.1 time-frame, the HTML WG should evaluate its suitability.

References:


Related: Working Group Decision on ISSUE-56 urls-webarch

Posted by Sam Ruby at

Scheme registration is an open issue in the URL Standard. It seems worthwhile to have a list of schemes along with their rules. How exactly that should be done is unclear. It seems pretty clear at this point that IETF/IANA has a pretty poor track record when it comes to registries.

The problem is that the URL Standard parser special cases a number of URL schemes due to deployed content that presumably resulted from poor initial implementations (and poor testing practices at the IETF). That special casing is what is annoying, but I do not really see a way to get rid of it. And we might need yet more special casing. What is important to move URLs forward is implementations trying to align with the specification. They are currently even further in the weeds and different from each other. Only until implementations start trying to align with the specification, will we uncover what the document actually has to say.

Posted by Anne van Kesteren at

Also, while I’m here, TLS?

Posted by Anne van Kesteren at

Also, while I’m here, TLS?

I’ve seen your series of posts on the subject, starting with TLS: first steps; is there is condensed how-to for DreamHost customers?

Posted by Sam Ruby at

There’s no point in holding HTML5 back for this, it’s not like this is HTML5’s major flaw. So I agree with your summary. I can’t really argue HTML5 made the wrong decision, given the scope and process within which it was being developed.

There’s any point in playing “blame” for the situation, it’s not really anyone’s responsibility to avoid conflicting specs from different organizations. It should be, but first people have to agree there’s a problem, and agree to work together to find a mutually acceptable solution.  Waiting for “the IETF” to fix something is a non-starter, it is, like w3c, a volunteer organization.  But the IETF URL specs say more--about comparison and presentation and finding URLs in plain text and bidi--that WHATWG URL spec doesn’t seem to touch.

As with encodings, this is a case where some prioritize “compatibility with legacy web content” higher than “compatibility with non-web Internet applications” — is that really in the interest of end users? Or is it a browser-war viewpoint? 

I’d say get on with HTML5, but acknowledge the issue in the specs so people don’t have to dig through blog posts to find out the real situation; try to leave the politics behind.

Posted by Larry Masinter at

I can recommend Option 4, followed by turning the “Web Hosting” column in “Manage Domains” into redirects and adding the HSTS header. If you have any questions feel free to ask on IRC.

Posted by Anne van Kesteren at

I think for end users compatibility with content triumphs compatibility with applications. The latter fade much more quickly than the former.

As for URLs, where does the IETF detail finding URLs in plain text? And as for bidirectional text, perhaps we should cover that. UI considerations are typically out-of-scope, but as these are non-trivial perhaps we could at least give some advice.

(Sam, your software keeps telling me I’m new here, in duplicate even.)

Posted by Anne van Kesteren at

URL test results

Posted by Sam Ruby at

URL Status

Plan B

W3C AC Poll results (member only)

Posted by Sam Ruby at

Why test Opera 12, but not Safari?

It seems about right that host parsing and file URL parsing is not particularly interopable. The former is full of subtle security issues not picked up by all browsers yet and the latter is full of disagreement and inertia. (Insofar you can claim URLs themselves are not full of inertia.)

Posted by Anne van Kesteren at

I am curious by the way which test suite you are using to claim that the technical stability of Plan B is superior. It seems you left that out of your email.

Posted by Anne van Kesteren at

Why test Opera 12, but not Safari?

We’d love to include Safari 8.  We need somebody with access to Safari 8 who can manage to copy/paste the results produced.  PLH has Safari 8 installed but ran into problems with capturing the results.  I do have access to a MacBookPro, but it is running Safari 7.

claim that the technical stability of Plan B is superior

I feel it is important to preface any statements by repeating that this option is only if all other options fail.  And that I’m not aware of anybody who feels that we are at that point yet.

But to answer your question: the root problem is that the URL standard (any version, be it a snapshot or the current master, it makes no difference) makes statements that are known to not currently true and may never be true.  One possible solution to that would be to make less normative statements.  That could be accomplished by removing material.  That could be accomplished by marking some of the material as informative.

The idea is that the remaining parts are the ones that are thought to be stable.  And can be proven to be via test cases.  And there you have the answer to your question.

- - -

But there is more to be said.  All politics aside, the right place for this work to be done is not to a copy, but into the upstream source.  But to address that discussion, it makes a sense to back up and look at the bigger picture.

In my opinion, we need to start by looking at the part that is often left out of the “Living Standard” discussions.  It is not one-dimensional choice between up-to-date and stale.  The problem space is actually multi-dimensional.  Proven vs experimental is another dimension.

Furthermore, the reality is that the web is so huge that inertia effectively makes stable specs have effectively an indefinite shelf life.  Chances for a mulligan are few and far between.  One of my personal favorite examples of this is mentioned in the URL standard.

The conclusion I reach is that defining standards via accretion is a valid model.  Define what is known, and describe what is unknown.  Validate the results through testing, and advertise a published snapshot when the tests results are good enough.  And then either in parallel or serially repeat this process and push back the boundary once again.

Raising this up a level, is there anyone out there that is served by describing how to handle file: URLs using the same level of normative language as the description of the  URLUtils interface?

I’m willing to help in any way I can.  As you are well aware, I am not allergic to liberal licenses, so that’s not an obstacle.

Posted by Sam Ruby at

urispec mailing list - CG - wiki

Posted by Sam Ruby at

The idea behind having normative statements for file URLs rather than an open issue is that somebody took the time to evaluate the playing field and came up with a recommendation for how to best get convergence. If that was not done it would still be an open issue. I could see having an inbetween state while implementations try to match this recommendation since there is often hiccups along the way, but given the complexity of the field that is hard to do sanely.

Posted by Anne van Kesteren at

Anne, I read you correctly, we agree that there is value in having a version of the document that limits itself to stable and proven normative statements, and that doing so would be hard.  If so, that’s progress!

The next question is: if I were to be able to find a willing victim (or be stupid enough to volunteer myself), do you have any recommendations on how this work would be structured?  Knowing the answer to that question will affect my ability to find a volunteer.

Posted by Sam Ruby at

What do you recommend doing for the parts that are unstable?

I know Gecko needs to rewrite its URL parser. I suspect IE might have to as well in due course. WebKit and Chromium have fairly reasonable parsers that might need some tweaking for certain cases, but probably no rewrite. wget/curl are mostly web-compatible but I suspect they want to align more. Once they have found the time to invest in convergence on URL parsing, what statements can they follow? I personally think that is a more important goal than identifying what matches between implementations today, as convergence is what standards are all about.

Posted by Anne van Kesteren at

What do you recommend doing for the parts that are unstable?

Identifying those parts as unstable, and adding text telling content producers what to avoid.

The current definition for file: URLs produces a number of test failures across all current implementations, and doesn’t match the definitions the URL standard intends to obsolete.

I can’t speak for Microsoft, but given the data I see, if I were in their position I would be skeptical.  If/when they were inclined to comment on these rules, is seems plausible that they will request changes.

Posted by Sam Ruby at

Yeah that seems fair. More research into file URLs would be welcome too. When I have time again I might revamp the way the parser is written down so file URLs have their own “code path”, effectively. That should also make it easier to add such notes.

Posted by Anne van Kesteren at

The system administration staff at the ASF took this down while debugging an unrelated problem.

Posted by Aaron at

You made some first rate points there. I appeared on the web for the difficulty and located most people will go together with with your website.[link]

Posted by loemrntdherid1 at

Add your comment