Companion to Atom

Work in progress.  By Sam Ruby

Iñtërnâtiônàlizætiøn

Before we can talk about HTTP content types, or about URL escape sequences, or about XML encoding, we need to tackle the topic of what exactly a character is.

Now admit it.  Despite the fact that you know that there are some excellent introductions to Unicode out there, you really don't want to learn Unicode, do you? What you really want is for someone to tell you what the minimum you need to do to not get your data garbled in transit or outright rejected upon receipt.  So what I am going to provide here is a survival guide of sorts.  What’s a survival guide, you ask?  Well, before you travel to a foreign country that speaks a language you don't know, it always helps to learn a few key phrases like “parlez vous anglais?” or “una cerveza por favor”.

OK, so we’ve established that you’ve got a tool that you want to ensure is internationalized.  The first thing I want you to do is to copy the string

Iñtërnâtiônàlizætiøn

into your tool and observe what comes out the other side.  If you have a weblogging tool, put it in the title, summary, content, and any other nook and cranny you can find.  Comments too, if you have got them.  Check every output that can be produced, including html and feeds.

Here are some common examples of garbled output:

cp437 I¤t‰rnƒti“n…liz‘ti?n
macroman I–t‘rn‰ti™nˆliz¾ti¿n
utf-8 Iñtërnâtiônàlizætiøn

If what you see matches one of the values in the second column, then the value in the first column is the equivalent of “parlez vous anglais?”  You can use this later.  Unfortunately, “cp437” and “macroman” are not widely “spoken”, so you really should seek alternatives.  If you have “utf-8”, you are actually ahead of the game.

Cleaning windows

If you can successfully process text with Diacritic marks, you are likely using either iso-8859-1, or its close cousin, windows-1252.  There is no polite way to put this, so if you are running on a Microsoft platform, or cut and paste from documents produced by Microsoft software, or even allow comments to be posted by people who might be doing one of the above, you need to be aware of the 27 differences, summarized by the table below:

character win-1252 unicode
decimalhexoctal html xml url
128 80 200 € € %E2%82%AC
130 82 202 ‚ ‚ %E2%80%9A
ƒ 131 83 203 ƒ ƒ %C6%92
132 84 204 „ „ %E2%80%9E
133 85 205 … … %E2%80%A6
134 86 206 † † %E2%80%A0
135 87 207 ‡ ‡ %E2%80%A1
ˆ 136 88 210 ˆ ˆ %CB%86
137 89 211 ‰ ‰ %E2%80%B0
Š 138 8A 212 Š Š %C5%A0
139 8B 213 ‹ ‹ %E2%80%B9
Π140 8C 214 ΠΠ%C5%92
Ž 142 8E 216 Ž Ž %C5%BD
145 91 221 ‘ ‘ %E2%80%98
146 92 222 ’ ’ %E2%80%99
147 93 223 “ “ %E2%80%9C
148 94 224 ” ” %E2%80%9D
149 95 225 • • %E2%80%A2
150 96 226 – – %E2%80%93
151 97 227 — — %E2%80%94
˜ 152 98 230 ˜ ˜ %CB%9C
153 99 231 ™ ™ %E2%84%A2
š 154 9A 232 š š %C5%A1
155 9B 233 › › %E2%80%BA
œ 156 9C 234 œ œ %C5%93
ž 158 9E 236 ž ž %C5%BE
Ÿ 159 9F 237 Ÿ Ÿ %C5%B8

The way to read this table is that the first column shows what the character should look like.  The next three columns describe the input character in decimal, hex, and octal.  The final three columns tell you what to replace this character with if your target is html, xml, or a url.

Note: as these code points are defined as control characters in iso-8859-1, you are safe to assume that you will not encounter any of these bytes in the normal course of processing user input; and therefore mapping these characters is a safe thing to do.

Alternatively, you can declare defeat, and use windows-1252 as your encoding.  However, as this encoding is not univerally recognized, you may be shutting out a portion of your intended audience.

The next step

If all you are looking to do the absolute minimum, there is no problem stopping when you successfully get to iso-8859-1.  The following is for people who want to make an informed decision on utf-8.  First, the steps to convert from iso-8859-1 to utf-8:

Or, if you prefer, here's some working code.

The primary advantage of utf-8 is that it can directly represent all Unicode characters (except Klingon).  With other encoding systems there are only partial solutions like numeric character references that work in many important contexts like (HTML and XML), but not in others (like FORM inputs).

The primary disadvantage of utf-8 is that some less frequently used characters will require additional bytes in order to be stored or transmitted.

Further reading

Search

Valid XHTML 1.1!