It’s just data


I’ve been looking into differences between the WHATWG URL Living Standard and the combination of RFC 3986 and RFC 3987.  I’ve come up with an indirect but effective way to identify the differences.  To start with I downloaded urltestdata.txt and urltestparser.  I then wrote a small script to convert the test data into json.

I then wrote another script to take this data and pass it through what is advertised as a closely conforming implementation of the relevant RFCs.

Looking at the results, the first set of issues related to the stripping of leading and trailing whitespace, so I updated the script to do that to focus on the remaining differences.  Similarly, the URL parsing definition includes the leading ? and # in the query and fragment values respectively, so I eliminated those differences in the cases where the values were non-empty.

The resulting script produces the this output.

The next set of differences concern canonicalization, so I ran tests using Addressable’s normalize method.  Note that as this as this non standard.  Updated output including normalization.