intertwingly

It’s just data

utf-8 musings


Jacques Distler: If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.

Consider the following observations about Musings:

Now let's take a look at utf-8 in Perl:

 use Encode;
 
 $input="49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E";
 print "$input\n";
 
 $input=decode('utf-8',pack("H*",$input));
 
 for ($i=0; $i<length($input); $i++) {
   printf "%X", ord(substr($input,$i,1));
 }

Which will produce the following output:

 49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E
 49F174EB726EE27469F46EE06C697AE67469F86E

Astute readers may note that the second line corresponds to iso-8859-1.  Confused yet? 

Now lets make a few observations which will simplify things:

Now let's try a Gedanken Experiment... imagine an alternate version of Musings in which every character returned in response to a HTTP GET request was precisely limited to US-ASCII.  That would mean that independent of how my comment was stored, it would be transmitted as the following (or equivalent):

I&ntilde;t&euml;rn&acirc;ti&ocirc;n&agrave;liz&aelig;ti&oslash;n

Such web pages could be validly declared as us-ascii, iso-8859-1 or utf-8.  In fact, the only operational difference would be how data returned from forms are encoded.

The comment forms on Musing have four input fields and one textarea.  Encode::decode can be used to convert the utf-8 bytes received into a Perl string.  The Perl length, ord, and substr functions work on characters (as of Perl version 5.6).  In fact, a loop like the above can be used to detect which characters have an ordinal value greater than 127, and replace such characters with a numeric character reference.