utf-8 musings
Jacques Distler: If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.
Consider the following observations about Musings:
- Conceptually, each web page is composed of characters from the ISO 10646 character set.
- Physically, each web page is a stream of bytes which nearly always are limited to US-ASCII. In fact, despite a reference to Erdös and the presence of a number of Hebrew characters, this web page was exclusively US-ASCII until I left this comment.
- By declaration, each web page is iso-8859-1.
Now let's take a look at utf-8 in Perl:
use Encode; $input="49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E"; print "$input\n"; $input=decode('utf-8',pack("H*",$input)); for ($i=0; $i<length($input); $i++) { printf "%X", ord(substr($input,$i,1)); }
Which will produce the following output:
49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E 49F174EB726EE27469F46EE06C697AE67469F86E
Astute readers may note that the second line corresponds to iso-8859-1. Confused yet?
Now lets make a few observations which will simplify things:
- The characters in US-ASCII are a proper subset of the characters in iso-8859-1, which in turn are a proper subset of the characters in iso 10646.
- The byte representation for the first 128 characters in the iso 10646 character set are the same in US-ASCII, iso-8859-1, and utf-8.
- In US-ASCII and iso-8859-1, the ordinal value of characters are equal to the numeric value of the corresponding bytes. In utf-8, bytes with a numeric value greater than 127 are part of a multibyte sequence.
Now let's try a Gedanken Experiment... imagine an alternate version of Musings in which every character returned in response to a HTTP GET request was precisely limited to US-ASCII. That would mean that independent of how my comment was stored, it would be transmitted as the following (or equivalent):
Iñtërnâtiônàlizætiøn
Such web pages could be validly declared as us-ascii, iso-8859-1 or utf-8. In fact, the only operational difference would be how data returned from forms are encoded.
The comment forms on Musing have four input fields and one
textarea. Encode::decode
can be used to convert
the utf-8 bytes received into a Perl string. The Perl
length
, ord
, and substr
functions work on characters (as of
Perl
version 5.6). In fact, a loop like the above can be used
to detect which characters have an ordinal value greater than 127,
and replace such characters with a
numeric
character reference.