It’s just data

Copy and Paste

Cory Doctorow: The theme for this year's ETech is "Remix," encompassing those nexus points of iterative hacking and large ideas that have a way of transforming technology

Cool.

The Problem

Now lets look at the next line:

* The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

sensor—and?  How did that happen?  Let's look at the O'Reilly source from which Cory copy and pasted:

The phone has become a platform, moving beyond mere voice to smart mobile sensor—and back to phone again, by way of voice-over-IP.

Much better.  But let's view source:

 <li>The phone has become a platform, moving beyond mere voice to smart mobile 
     sensor&#8212;and back to phone again, by way of voice-over-IP.</li>

Here's the first piece of the puzzle.  O'Reilly's weblog uses a Numeric character reference for an mdash.  It displays fine.

Now lets view source on the Boing Boing page.  What we see there is sensor—and.  Something clearly happened in the transfer.  Let's look closer at the three bytes, this time in hex: E28094.  This turns out to be the utf-8 representation of U+2014, which is known as an "em dash", which is the correct character.

However, my fully standards compliant browser displays this clean utf-8 as line noise.  What's going on?

The Cause

Further investigation reveals that the browser is displaying this as if it were encoded using windows-1252.  How did that happen?  The story continues.

Viewing source on Boing Boing once again, you will see an entirely futile attempt to declare the correct encoding:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

As we will see later, this is a hack that really shouldn't work, but does in many cases.  Just not this one.  Continuing on the trail, we take a look at the HTTP headers that are returned by Boing Boing:

 HTTP/1.1 200 OK
 Date: Wed, 22 Sep 2004 16:47:16 GMT
 Server: Apache/2.0.40 (Red Hat Linux)
 Accept-Ranges: bytes
 Vary: Accept-Encoding,User-Agent
 Content-Type: text/html; charset=ISO-8859-1

The last line is key.  As Cory once said, There's more than one way to describe something.  OK, so Cory was talking about something else there, but we are dealing with metacrap nevertheless.  In this case, the charset of the page is described in two different places.  One that is is completely correct, and completely ignored.  And one that is incorrect, and partially ignored.

Windows-1252 is the encoding that is nearly equivalent to the http's default encoding of iso-8859-1, but it differs in twenty seven different ways, two of which are evident here.

Precedence Rules

In this case, the data inside the document is correct, and the data which accompanies the document in the transfer is incorrect.  I have a theory that, in general, the accuracy of metadata is inversely proportional to the distance between the metadata and the data which it proports to describe.  Apparently, the authors of HTTP and HTML disagree with me, as the priorities are defined to be first the HTTP content type; then the meta element in HTML head; and finally any charset attributes on any element in the HTML body.  Given that the HTTP Content type defines a default charset, you would think that the others would never come into play, but here a bit of reality intrudes.  Direct from the HTML specification itself:

The HTTP protocol ( [RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

OK, so this section of the specification explains why the correct encoding which is placed in something clearly designed as a hack (meta http-equiv) to address exactly this situation, is ignored.

We still haven't fully explained the line noise.  Even if the "wrong" piece of metadata was picked, why was windows-1252 selected?  This has been a point of contention for a number of years, and has inspired the creation of tools such as the Demoroniser which have this to say on the subject:

A little detective work revealed that, as is usually the case when you encounter something shoddy in the vicinity of a computer, Microsoft incompetence and gratuitous incompatibility were to blame. Western language HTML documents are written in the ISO 8859-1 Latin-1 character set, with a specified set of escapes for special characters. Blithely ignoring this prescription, as usual, Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.

If you stroll through the Mozilla bug database, you will can chronicle the transition through the stages of grief: from denial to anger to bargaining to depression, and ultimately acceptance.

Recapping: we have a page which is correctly encoded as utf-8.  Mozilla ignores this as well as the declaration inside the body that this is so.  Instead it choses to respect the HTTP header, which it finds to be incorrect, so it compensates by introducing a windows specific encoding.

Solution

Now that we have identified the header that is in error, the questions that remain are:

Conclusion

The web as we know it is built upon a foundation of concepts such as Characters, HTML, and HTTP.  These concepts are still evolving, are not always mutually consistent, and incomplete.  Sometimes in order to solve problems such as these, you need to not only know what the standards say, but which parts there is general agreement on, and which parts are pretty consistently ignored.

This has a potential to make recombinant data services and syndicated e-commerce a pretty messy matter indeed.  Particularly if internationalization is a requirement.

I now know what I am going to be submitting as a proposal.


Interesting.  I ssh'd into my own box and put a default charset in there just to make things cleaner ... maybe.  ;)

Posted by Geof at

Sam Ruby: Copy and Paste

Cory Doctorow tries (and fails) to Escape The Madness. Sam Ruby documents the inevitable result....

Excerpt from del.icio.us/ffg at

Sam,

Is there a good web resource or book where one can read up on how to figure out what exactly is going on when a site is experiencing character set problems like this?

Posted by Scott Johnson at

Sam Ruby: Copy and Paste

Simon Willison : Sam Ruby: Copy and Paste - This character encoding glitch has bitten me more times than I care to say....

Excerpt from HotLinks - Level 1 at

Scott: not that I'm aware of.  The primary reason why I know about this stuff is because I wrote my own weblogging software.  Think about it: my page (talking about weird characters) faithfully displays these same characters on a range of browsers.  And this entry in my various feeds work too - across a range of aggregators.

Getting that to work properly is much harder than it ought to be.

Posted by Sam Ruby at

Sam,

Perhaps you should write a book.  ;-)

I can easily see how getting all of this stuff to work on your site would be harder than it ought to be.  Especially without good, simple documentation on techniques to accomplish such a feat.

Posted by Scott Johnson at

I made a php function to get some xml feeds and print them ina web page but I was getting bad charaters in the page, I didn't have much knowledge about character sets and all that stuff but I kept whole night reading about it, I realized that I needed to use utf8 to properly display the content so I used the utf8_decode for the job.

Well, is been a hell of a night, I just didn't know how tricky this part of html could be.

Regards

Posted by Venezolano at

Sam Ruby: Copy and Paste

Sam Ruby: Copy and Paste, via Simon. I haven't had this problem because I do everything right (all Unicode, all the time). But I've seen other sites where this happens a lot, particularly Roger Simon's site....

Excerpt from Keith's Weblog at

Sam,
Just keep writing the blog and the book will write itself ;-)
Thanks for grappling with all this stuff and putting it online in such clear terms. One of these days I'm going to get my head around it, I just know I am.

Posted by Phil Wainewright at

Shouldn't Apache have noticed the http-equiv and returned the correct encoding? IIRC, http-equiv was created so that web servers would not require such global configuration changes, and would indicate the correct encoding per-resource.

Posted by Ziv Caspi at

Your efforts to describe and document problems with character encoding have been incredibly useful for me. Following your advice, I switched to using UTF-8 on all of my websites, and it has certainly helped. Despite that, I still get bitten by the odd character which causes my pages to barf when served as application/xhtml+xml.

Posted by Simon Jessey at

"How should this be fixed"--don't they simply need to put quote-marks on both sides of utf-8 to fix this?

Posted by Adam Rice at

Adam: where?  On the meta element?  No, this element is correct as coded.  Check out google.com to see an instance where this works.  Of course, there it isn't used to affect the rendering of the page, it is used in the hopes that it will hopes that it influence the encoding used by the browser when submitting the data from the form.

Yet another case where standards are incomplete, inconsistent, and partially respected and partially ignored.

Posted by Sam Ruby at

Sam Ruby: Copy and Paste

Eine nette und ausführliche Erläuterung von meta-Tags mit Zeichensatzangaben, dem HTTP-Content-Type Header mit Zeichensatzangabe und das, was Browser daraus machen. Ich sags ja immer, das Web ist eine technische Müllhalde, die zufälligerweise...

Excerpt from Hugos house of programming error at

Sam Ruby: Copy and Paste

Eine nette und ausführliche Erläuterung von meta-Tags mit Zeichensatzangaben, dem HTTP-Content-Type Header mit Zeichensatzangabe und das, was Browser daraus machen. Ich sags ja immer, das Web ist eine technische Müllhalde, die zufälligerweise...

Excerpt from Hugos House of Weblog Horror at

A description of problems caused by character encoding mismatches....

Excerpt from 456 Berea Street at

msdn2.microsoft.com

doesn’t (at the bottom of the page)

Posted by MK at

MK: Thanks!

It reads doesn’t.  Even in IE.

Posted by Sam Ruby at

I'd forgotten this little twist on charset until Gavin reminded me: probably BoingBoing shouldn't have removed the charset from the Content-type header, the solution they seem to have chosen, because they are sending gzipped content. Now they are at risk of having Moz get part of the content, unzip it, discover that it should be in a charset other than the one it expected from HTTP's default, and get into an argument with Apache over what the proper offset is for a re-request.

Posted by Phil Ringnalda at

While struggling getting an translated major web site running, I found this tutorial to be very, very helpful.

[link]

Dealing with any J2EE implementations and character encoding is an exercise in patience. For anyone struggling with that, I found that (logically when you think about it) any dynamic JSP includes must have the content-type set in the included JSP as they are compiled and executed independently from the JSP calling them. At least in JSP 1.1, which is what we were using.

As far as browsers go, just remember - the HTTP headers trump all. If you've set those, it doesn't matter what else you do. If you haven't set those, then you can concern yourself with the meta-tags.

Personally, I suggest going the HTTP header route. It's at least consistent and explicit.

Posted by Andrew Robinson at

Building character

I've been working on mod_blog for the past few days. Bob, who runs the Friday D&D game, has a site that used to use Blogger but due to reliability problems (as well as certain security issues related to FTP) I switched them over to mod_blog....

Excerpt from The Boston Diaries at

Cultural Sensitivity in Technology

The ‘Coltrane’ release of Lotus Freelance Graphics, the 1998 version bundled with Lotus SmartSuite, was famously held up for a month when it was found out that one of the clip art images had a tiny 20-pixel image of Taiwanese currency (rather than...

Excerpt from Koranteng's Toli at

can i get copy and paste work
if i then mail me at raiseinheaven@rediffmail.com

Posted by yuvraj at

Unicode and Ruby

I gave a presentation called I18n, M17n, Unicode, And All That at the recent 2006 RubyConf in Denver. This piece doesn’t duplicate this presentation; it outlines the problem, some conference conversation, and includes a couple of images that you...

Excerpt from ongoing at

Answer by Lichtamberg for Change the charset attribute in the html header

Maybe you should try to change your webservers-encodings? f.e. adding another charset? see [link] or maybe this: [link]...

Excerpt from Change the charset attribute in the html header - Stack Overflow at

Core design principles

I intend Foliomatic primarily for personal web sites such as this one. It should also be useful for projects and organizations whose Web presence is mostly static content, updated from time to time. It is not going to be a … Continue reading →...

Excerpt from Owl's Portfolio at

Add your comment