utf-8 musings

2004-04-25T16:30:46Z

Jacques Distler: If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.

Consider the following observations about Musings:

Conceptually, each web page is composed of characters from the ISO 10646 character set.
Physically, each web page is a stream of bytes which nearly always are limited to US-ASCII. In fact, despite a reference to Erdös and the presence of a number of Hebrew characters, this web page was exclusively US-ASCII until I left this comment.
By declaration, each web page is iso-8859-1.

Now let's take a look at utf-8 in Perl:

 use Encode;
 
 $input="49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E";
 print "$input\n";
 
 $input=decode('utf-8',pack("H*",$input));
 
 for ($i=0; $i<length($input); $i++) {
   printf "%X", ord(substr($input,$i,1));
 }

Which will produce the following output:

 49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E
 49F174EB726EE27469F46EE06C697AE67469F86E

Astute readers may note that the second line corresponds to iso-8859-1. Confused yet?

Now lets make a few observations which will simplify things:

The characters in US-ASCII are a proper subset of the characters in iso-8859-1, which in turn are a proper subset of the characters in iso 10646.
The byte representation for the first 128 characters in the iso 10646 character set are the same in US-ASCII, iso-8859-1, and utf-8.
In US-ASCII and iso-8859-1, the ordinal value of characters are equal to the numeric value of the corresponding bytes. In utf-8, bytes with a numeric value greater than 127 are part of a multibyte sequence.

Now let's try a Gedanken Experiment... imagine an alternate version of Musings in which every character returned in response to a HTTP GET request was precisely limited to US-ASCII. That would mean that independent of how my comment was stored, it would be transmitted as the following (or equivalent):

I&ntilde;t&euml;rn&acirc;ti&ocirc;n&agrave;liz&aelig;ti&oslash;n

Such web pages could be validly declared as us-ascii, iso-8859-1 or utf-8. In fact, the only operational difference would be how data returned from forms are encoded.

The comment forms on Musing have four input fields and one textarea. Encode::decode can be used to convert the utf-8 bytes received into a Perl string. The Perl length, ord, and substr functions work on characters (as of Perl version 5.6). In fact, a loop like the above can be used to detect which characters have an ordinal value greater than 127, and replace such characters with a numeric character reference.