utf-8x2

2006-05-04T18:06:48Z

Gordon Weakliem: Response is a hash of the username, realm and password, joined by colons, along with hashes of other request parameters. BUT the first hash is apparently generated using Encoding.Default (e.g. Windows-1252 for my machine), not UTF-8 as one might think. I say “apparently” because looking in the disassembly, it looks like they just copy bytes from the string into a buffer to be hashed, instead of using Encoding.GetBytes(string). Whatever encoding that is, it’s not UTF-8, at least not on my box. So all the hard work of interpreting the charset parameter is for nothing - the server still has to guess at a charset to use to calculate the digest.

It gets worse.

First, suppose somebody notices this and decides to encode their data as utf-8 before they make this call. Now everything is working to the spec.

Now suppose a fix is made to interpret the stream of bytes using Encoding.Default, and convert that interpretation into utf-8. Don’t say it can’t happen. This problem is fairly common.

Now, suppose somebody, in another context, decides to play it safe and convert such data into 7-bit ASCII safe entity character references. And get fancy and not just use numeric character references, but use named character references when they can. Oh, and neglect to include any DTD which defines these character references, because this is RSS 0.92, and being well-formed is so RSS 0.91.

Farfetched? Look at this page. See the following?

function(){…}()

Now, look at the feed. Doesn’t look like “dot, dot, dot”, does it?

Given this situation, what’s a fellow to do?

Why, submit a patch and a half dozen test cases to the Universal Feed Parser, of course. Because when all is said and done, users will blame the victim.

Which, in this case, may very well be somebody on Gordon’s team.