Companion to Atom
Work in progress. By Sam Ruby
Iñtërnâtiônàlizætiøn
Before we can talk about HTTP content types, or about URL escape sequences, or about XML encoding, we need to tackle the topic of what exactly a character is.
Now admit it. Despite the fact that you know that there are some excellent introductions to Unicode out there, you really don't want to learn Unicode, do you? What you really want is for someone to tell you what the minimum you need to do to not get your data garbled in transit or outright rejected upon receipt. So what I am going to provide here is a survival guide of sorts. What’s a survival guide, you ask? Well, before you travel to a foreign country that speaks a language you don't know, it always helps to learn a few key phrases like “parlez vous anglais?” or “una cerveza por favor”.
OK, so we’ve established that you’ve got a tool that you want to ensure is internationalized. The first thing I want you to do is to copy the string
Iñtërnâtiônàlizætiøn
into your tool and observe what comes out the other side. If you have a weblogging tool, put it in the title, summary, content, and any other nook and cranny you can find. Comments too, if you have got them. Check every output that can be produced, including html and feeds.
Here are some common examples of garbled output:
cp437 I¤t‰rnƒti“n…liz‘ti?n macroman I–t‘rn‰ti™nˆliz¾ti¿n utf-8 Iñtërnâtiônà lizætiøn
If what you see matches one of the values in the second column, then the value in the first column is the equivalent of “parlez vous anglais?” You can use this later. Unfortunately, “cp437” and “macroman” are not widely “spoken”, so you really should seek alternatives. If you have “utf-8”, you are actually ahead of the game.
Cleaning windows
If you can successfully process text with Diacritic marks, you are likely using either iso-8859-1, or its close cousin, windows-1252. There is no polite way to put this, so if you are running on a Microsoft platform, or cut and paste from documents produced by Microsoft software, or even allow comments to be posted by people who might be doing one of the above, you need to be aware of the 27 differences, summarized by the table below:
character win-1252 unicode decimal hex octal html xml url € 12880200€€%E2%82%AC‚ 13082202‚‚%E2%80%9Aƒ 13183203ƒƒ%C6%92„ 13284204„„%E2%80%9E… 13385205……%E2%80%A6† 13486206††%E2%80%A0‡ 13587207‡‡%E2%80%A1ˆ 13688210ˆˆ%CB%86‰ 13789211‰‰%E2%80%B0Š 1388A212ŠŠ%C5%A0‹ 1398B213‹‹%E2%80%B9Œ 1408C214ŒŒ%C5%92Ž 1428E216ŽŽ%C5%BD‘ 14591221‘‘%E2%80%98’ 14692222’’%E2%80%99“ 14793223““%E2%80%9C” 14894224””%E2%80%9D• 14995225••%E2%80%A2– 15096226––%E2%80%93— 15197227——%E2%80%94˜ 15298230˜˜%CB%9C™ 15399231™™%E2%84%A2š 1549A232šš%C5%A1› 1559B233››%E2%80%BAœ 1569C234œœ%C5%93ž 1589E236žž%C5%BEŸ 1599F237ŸŸ%C5%B8
The way to read this table is that the first column shows what the character should look like. The next three columns describe the input character in decimal, hex, and octal. The final three columns tell you what to replace this character with if your target is html, xml, or a url.
Note: as these code points are defined as control characters in iso-8859-1, you are safe to assume that you will not encounter any of these bytes in the normal course of processing user input; and therefore mapping these characters is a safe thing to do.
Alternatively, you can declare defeat, and use windows-1252 as your encoding. However, as this encoding is not univerally recognized, you may be shutting out a portion of your intended audience.
The next step
If all you are looking to do the absolute minimum, there is no problem stopping when you successfully get to iso-8859-1. The following is for people who want to make an informed decision on utf-8. First, the steps to convert from iso-8859-1 to utf-8:
- All characters in the range of 0-127 (hex 00 through 7F), are represented identically in both encodings. This covers the entire range of the original ASCII characters.
- All iso-8859-1 characters in the range of 128-191 (hex 80 through BF) need to be preceeded by a byte with the value of 194 (hex C2) in utf-8, but otherwise are left intact.
- All iso-8859-1 characters in the range of 192-255 (hex C0 through FF) not only need to be preceede by a byte with the value of 195 (hex C3) in utf-8, but also need to have 64 (hex 40) subtracted from the iso-8859-1 character value. For example, a “ñ” (decimal 241, hex F1) becomes a 195 followed by a 177 (hex C3 B1).
Or, if you prefer, here's some working code.
The primary advantage of utf-8 is that it can directly represent all Unicode characters (except Klingon). With other encoding systems there are only partial solutions like numeric character references that work in many important contexts like (HTML and XML), but not in others (like FORM inputs).
The primary disadvantage of utf-8 is that some less frequently used characters will require additional bytes in order to be stored or transmitted.
Further reading
- Introduction/motivation
- Reference
- A tutorial on character code issues by Jukka "Yucca" Korpela.
- The character encoding problem by Peter Verthez
- Introduction to i18n by Tomohiro KUBOTA
- The Properties and Promizes of UTF-8 by Martin J. Dürst
- Tool support
- ecto and character encoding by Adriaan Tijsseling
- MTStripControlChars by Jacques Distler
- Mozilla bugs: 18643, 81203, 228779
- Displaying international characters in JSP