Dragons be gone
Jacques Distler: if anyone tells you: “i18n is easy, just use utf-8!” … go ahead and smack them.
Luckily, I’m outside of arms reach. You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8.
Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Microsoft), or macroman (Apple).
In fact, it would also be 100% valid if it were declared as encoded as us-ascii.
So, I’m not actually disagreeing with Jacques, and therefore probably could avoid a smack. I would however, quibble with his first line:
i18n is hard. Don’t let anyone tell you any different.
Certain versions of certain tools don’t handle utf-8 well: that much I agree with. I would also add that finding the right combination for a given configuration and keeping it working through upgrades is a bit of an effort.
But i18n != utf-8. You can i18n with us-ascii just fine. Just use numeric entities. In my Python implementation, the logic to do this is a bit spread out, but I have build a more compact Ruby version.
Moral of the story, don’t convert to utf-8 in a plugin unless you are certain that every link in the chain can handle utf-8 properly. If you feel that you must convert to utf-8, then is best to do it as some sort of post-processing filter after all the other logic has taken place.
But, for me, numeric entities are just fine.