This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit atomenabled.org for more info.
http://www.intertwingly.net/blog/index.atom Sam Ruby It’s just data Sam Ruby rubys@intertwingly.net http://www.intertwingly.net/blog/ 2005-11-05T04:43:50-05:00 tag:intertwingly.net,2004:2100 Sometimes the dragon wins
If you have a requirement for full text search, and you haven’t outsourced it to google, then you need a database that understands encodings, and all of Julik’s points apply.

Scott Johnson: ɥɦɐ I just had to try out some funky characters to see what would happen.  :)

An advantage of declaring this page as utf-8 is that I can distinguish between somebody typing ɥɦɐ and ɥɦɐ, meaning that people don’t have to double escape if they want to talk about numeric entities on my weblog.

But don’t try to search for ɥɦɐ.  While such a query will be properly URI encoded based on utf-8, that particular string does not appear in any text files.

So, sometimes the dragon wins.  If you have a requirement for full text search, and you haven’t outsourced it to google, then you need a database that understands encodings, and all of Julik’s points apply.

Before I deploy my Ruby based weblog, I want to make sure that both fastcgi and a database that supports utf-8 are in place (Cornerhost is currently running mysql 3.23.58).

Some footnotes:

  • Beyond Java?  At the present time, Java is better than Python, PHP, Perl, and Ruby in handling Unicode.
  • Taking charge of your own destiny?  Sure, I have access to the full source to MySQL, but you think I’m going to hack Unicode support in there?  Heck no, there be dragons in there!  It’s cheaper to switch databases (or, in this case, upgrade to a new version).
  • Actually innovative?  I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future.  I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI'd away,  coupled with the spirit of purports to conform.
2005-11-04T08:08:47-05:00
tag:intertwingly.net,2004:2100-1131134721 http://www.sencer.de dslb-084-063-021-172.pools.arcor-ip.net form Sencer http://www.sencer.de Sencer Sometimes the dragon wins

<pedantic>
I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...
</pedantic>

And to actually say something useful: The next version of PHP will have full Unicode support (it’s already in CVS), but I am not sure when that version will be stable. And then there is of course the issue of when you’ll be able to get it at your shared-hosting provider. I mean, mysql4.1 is relatively old, but look at how many (or few) places offer you mysql4.1 or up. And It also looks like PHP4 is still way ahead in terms of availability than PHP5.x

2005-11-04T10:05:21-05:00
tag:intertwingly.net,2004:2100-1131136033 http://www.intertwingly.net/blog/ cpe-066-057-027-065.nc.res.rr.com form Sam Ruby http://www.intertwingly.net/blog/ Sam Ruby Sometimes the dragon wins

I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...

Fixed, thanks!

PHP 6’s support for Unicode looks very nice.

2005-11-04T10:27:13-05:00
tag:intertwingly.net,2004:2100-1131142439 http://blogamundo.net/dev 201-27-7-69.dsl.telesp.net.br form Jonas Galvez http://blogamundo.net/dev Jonas Galvez Sometimes the dragon wins

Sam, I’m currently working on a project that relies heavily on UTF-8 (imagine having to search and display massive amounts of data in every language), and we have managed to get it working properly with MySQL (4.1+) and Rails. There are some limitations, of course, but it’s not as bad as we thought it would be (it seems MySQL does actually enable full-text search with Unicode when you use MyISAM and Ruby’s String#each_char is UTF-8 aware if $KCODE = 'u').

Documentation is here, here and here.

2005-11-04T12:13:59-05:00
tag:intertwingly.net,2004:2100-1131142803 84.5.183.26 form Thomas Broyer Thomas Broyer Sometimes the dragon wins
As you’re guarantied to have only numeric character entites (NCR) in your files, cannot you encode the search query from UTF-8 into us-ascii+NCR before actually searching?
2005-11-04T12:20:03-05:00
tag:intertwingly.net,2004:2100-1131148709 http://www.intertwingly.net/blog/ cpe-066-057-027-065.nc.res.rr.com form Sam Ruby http://www.intertwingly.net/blog/ Sam Ruby Sometimes the dragon wins

Thomas: what I have is a word search, meaning that you won’t find “ear” in “search”.

This is implemented using swish-e, which doesn’t handle utf-8 very well.

2005-11-04T13:58:29-05:00
tag:intertwingly.net,2004:2100-1131149729 http://keithdevens.com/weblog/archive/2005/Nov/04/Ruby.dragon-wins excerpt Keith's Weblog Sam Ruby: Sometimes the dragon wins
Sam Ruby: Sometimes the dragon wins. Not here: ɥɦɐ. And you can search for it. My dirty little secret, however, is that I’m storing everything in MySQL in fields declared to be encoded in latin1, storing UTF-8 in there anyway, and trusting the...
2005-11-04T14:15:29-05:00
tag:intertwingly.net,2004:2100-1131149731 http://del.icio.us/url/8a2ae3b253e9eb7cf84024bd1c0fe589 excerpt del.icio.us/tag/ruby Sam Ruby: Sometimes the dragon wins
Check the comments....
2005-11-04T14:15:31-05:00
tag:intertwingly.net,2004:2100-1131198314 http://saladwithsteve.com/2005/11/sam-ruby-sometimes-dragon-wins.html excerpt saladwithsteve Sam Ruby: Sometimes the dragon wins
Sam Ruby: Sometimes the dragon wins: “I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future.” Also in that post: "At...
2005-11-05T03:45:14-05:00
tag:intertwingly.net,2004:2100-1131201643 http://www.dedasys.com/davidw/ host80-158.pool8258.interbusiness.it form David N. Welton http://www.dedasys.com/davidw/ David N. Welton Sometimes the dragon wins
Tcl has had good unicode support for several years.  It slows things down a little bit if all you ever need to handle is plain ASCII, but that’s because it’s thoroughly integrated into the core of the language.  In my experience, Tcl handles most i18n tasks with aplomb.
2005-11-05T04:40:43-05:00
tag:intertwingly.net,2004:2100-1131201829 http://naeblis.cx/rtomayko/ 69-168-180-186.clvdoh.adelphia.net form Ryan Tomayko http://naeblis.cx/rtomayko/ Ryan Tomayko Sometimes the dragon wins


I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI’d away.

Yep. I’d question whether a basic understanding and conformance to baseline character encoding conventions is an edge case at all. It fits solidly into the 80 for a large portion of the world’s population.

2005-11-05T04:43:49-05:00