Sometimes the dragon wins
Scott Johnson: ɥɦɐ I just had to try out some funky characters to see what would happen. :)
An advantage of declaring this page as utf-8 is that I can
distinguish between somebody typing ɥɦɐ and
ɥɦɐ
, meaning that people
don’t have to double escape if they want to talk about
numeric entities on my weblog.
But don’t try to search for ɥɦɐ. While such a query will be properly URI encoded based on utf-8, that particular string does not appear in any text files.
So, sometimes the dragon wins. If you have a requirement for full text search, and you haven’t outsourced it to google, then you need a database that understands encodings, and all of Julik’s points apply.
Before I deploy my Ruby based weblog, I want to make sure that both fastcgi and a database that supports utf-8 are in place (Cornerhost is currently running mysql 3.23.58).
Some footnotes:
- Beyond Java? At the present time, Java is better than Python, PHP, Perl, and Ruby in handling Unicode.
- Taking charge of your own destiny? Sure, I have access to the full source to MySQL, but you think I’m going to hack Unicode support in there? Heck no, there be dragons in there! It’s cheaper to switch databases (or, in this case, upgrade to a new version).
- Actually innovative? I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future. I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI'd away, coupled with the spirit of purports to conform.