It’s just data

Rails and Snowmen

People have started to notice that Rails is adding a snowman to their URLs.  There even is now a website devoted to this.

These types of social implications of technical decisions fascinate me.  Here’s some further background that I have pieced together.  I may have some details wrong, corrections welcome.

For starters, Rails by default standardizes on utf-8 for web pages.  As with pretty much everything in Rails, you can change the default, but virtually nobody does.  Utf-8 is a good choice here, and certainly is better than iso-8859-1 or win-1252.

Rails provides the encoding information on the Content-Type header, and on the accept-charset attribute.  Under normal circumstances, this will cause all responses to be encoded as utf-8, across all commonly used browsers.  Yes, including IE.

Most pages in Rails are produced using templates, and generally these templates are not the problem.  Data in those templates typically come from databases, and sometimes data can get into databases that isn’t 100% pure and clean.  In particular, sometimes this data may have encoding errors.  Such errors can easily become visible when that data is displayed in a form.

Browser recovery strategies vary on encoding errors, but often involve displaying a diamond with a question mark in it.

User behavior varies in the presence of such errors, but a common reaction is to switch the encoding.

The trouble starts when the user then proceeds to submit the form.  The net result, with some browsers, is that the data is sent respecting the user’s choice.  In other cases, browsers send the data using the application’s choice.

How Rails will react to encoding other than utf-8 being used depends on the version of Rails, the version of Ruby and a number of other factors.  In some cases, the result is an HTTP 400 response code (Bad Request).  In others, a 500 (Server Error).  In others, a 404 (Not found).  In others, even more misencoded data will make it all the way to the database.

As I said, sometimes the browser will chose to respect the user’s choice.  This is generally only done if it is possible to do so.  As not every character can be encoded using Western ISO Latin1, including such a character in a hidden field has been found to be an effective strategy of forcing the browser’s hand.

Enter the snowman.

In most cases, this is simply invisible metadata that solves a real problem that is otherwise hard to describe and debug.

Unfortunately, it isn’t always so invisible.  Try a query on this page and observe the resulting URI.  This page opted to use HTTP GET in order to make the URI meaningful.  Unfortunately the URIs with the latest version of Rails now have a bit of exposed cruft.

The fact that people care about such things to complain indicates that socialization of the concept of that URIs are to be meaningful is working.  The unfair perception that this is (yet another) workaround for IE has also entered into the debate.

This is a very real problem.  One without clean and comprehensive solutions.  The Rails team is aware of the _charset_ hidden value, but that opens up a different set of problems.

Solutions being discussed to date include renaming the form field, choosing a different character, moving the field to the end of the query, and providing a mechanism to opt out.