PHP and Unicode.

2004-10-01T19:45:55Z

Jarek Zgoda: It still doesn't have native unicode support, so all this XML buzz is just that -- a buzz. In modern world lacking of unicode awareness makes any solution incomplete.

I agree with Adam Trachtenberg, Unicode support is on my list of things that would be great to add to PHP 6

Sterling Hughes and Thies Arntzen point out that Parrot is fully Unicode, but that largely is due to the use of the ICU libraries, and only addresses a small part of the problem. The hard part is all of the inputs, outputs, and extensions.

My recommendation would be to first upgrade the current code base to using utf-8 internally and use that to shake out all of the interface problems. Utf-8 has a number of desirable features:

all plain ASCII string are also a valid UTF-8 strings
functions like strpos can continue to operate in a byte oriented manner
UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.

Overall, I would suggest the following:

On all inputs, try utf-8 first. If the byte sequence is not valid utf-8, go for win-1252. (Yes, iso-8859-1 is the "right" answer, but read here and here as to why win-1252 should be picked anyway).
Update internal routines to ignore bytes in the range of 0x80 and 0xBF when computing character indexes in strings.
When the output encoding is not declared, outputting all using Numeric Character References will work equally well with HTML and XML

Determining the output encoding by parsing the Content-Type header would be a good idea, as would be providing functions that explicitly set the default input and output encodings.

Once the bugs are shaken out, upgrading to ICU (or converting to something like Parrot) would be a considerably simpler proposition.