Copy and Paste

2004-09-23T21:07:38Z

Cory Doctorow: The theme for this year's ETech is "Remix," encompassing those nexus points of iterative hacking and large ideas that have a way of transforming technology

Cool.

The Problem

Now lets look at the next line:

* The phone has become a platform, moving beyond mere voice to smart mobile sensorâ€”and back to phone again, by way of voice-over-IP.

sensorâ€”and? How did that happen? Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

Much better. But let's view source:

 <li>The phone has become a platform, moving beyond mere voice to smart mobile 
     sensor&#8212;and back to phone again, by way of voice-over-IP.</li>

Here's the first piece of the puzzle. O'Reilly's weblog uses a Numeric character reference for an mdash. It displays fine.

Now lets view source on the Boing Boing page. What we see there is sensorâ€”and. Something clearly happened in the transfer. Let's look closer at the three bytes, this time in hex: E28094. This turns out to be the utf-8 representation of U+2014, which is known as an "em dash", which is the correct character.

However, my fully standards compliant browser displays this clean utf-8 as line noise. What's going on?

The Cause

Further investigation reveals that the browser is displaying this as if it were encoded using windows-1252. How did that happen? The story continues.

Viewing source on Boing Boing once again, you will see an entirely futile attempt to declare the correct encoding:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

As we will see later, this is a hack that really shouldn't work, but does in many cases. Just not this one. Continuing on the trail, we take a look at the HTTP headers that are returned by Boing Boing:

 HTTP/1.1 200 OK
 Date: Wed, 22 Sep 2004 16:47:16 GMT
 Server: Apache/2.0.40 (Red Hat Linux)
 Accept-Ranges: bytes
 Vary: Accept-Encoding,User-Agent
 Content-Type: text/html; charset=ISO-8859-1

The last line is key. As Cory once said, There's more than one way to describe something. OK, so Cory was talking about something else there, but we are dealing with metacrap nevertheless. In this case, the charset of the page is described in two different places. One that is is completely correct, and completely ignored. And one that is incorrect, and partially ignored.

Windows-1252 is the encoding that is nearly equivalent to the http's default encoding of iso-8859-1, but it differs in twenty seven different ways, two of which are evident here.

Precedence Rules

In this case, the data inside the document is correct, and the data which accompanies the document in the transfer is incorrect. I have a theory that, in general, the accuracy of metadata is inversely proportional to the distance between the metadata and the data which it proports to describe. Apparently, the authors of HTTP and HTML disagree with me, as the priorities are defined to be first the HTTP content type; then the meta element in HTML head; and finally any charset attributes on any element in the HTML body. Given that the HTTP Content type defines a default charset, you would think that the others would never come into play, but here a bit of reality intrudes. Direct from the HTML specification itself:

The HTTP protocol ( [RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

OK, so this section of the specification explains why the correct encoding which is placed in something clearly designed as a hack (meta http-equiv) to address exactly this situation, is ignored.

We still haven't fully explained the line noise. Even if the "wrong" piece of metadata was picked, why was windows-1252 selected? This has been a point of contention for a number of years, and has inspired the creation of tools such as the Demoroniser which have this to say on the subject:

A little detective work revealed that, as is usually the case when you encounter something shoddy in the vicinity of a computer, Microsoft incompetence and gratuitous incompatibility were to blame. Western language HTML documents are written in the ISO 8859-1 Latin-1 character set, with a specified set of escapes for special characters. Blithely ignoring this prescription, as usual, Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.

If you stroll through the Mozilla bug database, you will can chronicle the transition through the stages of grief: from denial to anger to bargaining to depression, and ultimately acceptance.

Recapping: we have a page which is correctly encoded as utf-8. Mozilla ignores this as well as the declaration inside the body that this is so. Instead it choses to respect the HTTP header, which it finds to be incorrect, so it compensates by introducing a windows specific encoding.

Solution

Now that we have identified the header that is in error, the questions that remain are:

How did Boing Boing get to this state? - this is pretty easy. It probably occurred from the transition from Blogger to MovableType. Blogger's pages tend to use the iso-8859-1 encoding; MovableType's pages tend to use utf-8.
How should this be fixed? - this too is easy. Within the Apache configuration there is either an AddCharset or AddDefaultCharset directive that specifies iso-8859-1. Such directives should either be updated to reflect the usage of the utf-8 encoding, or removed entirely - allowing the meta http-equiv to take effect.

Conclusion

The web as we know it is built upon a foundation of concepts such as Characters, HTML, and HTTP. These concepts are still evolving, are not always mutually consistent, and incomplete. Sometimes in order to solve problems such as these, you need to not only know what the standards say, but which parts there is general agreement on, and which parts are pretty consistently ignored.

This has a potential to make recombinant data services and syndicated e-commerce a pretty messy matter indeed. Particularly if internationalization is a requirement.

I now know what I am going to be submitting as a proposal.