Companion to Atom
Work in progress. By Sam Ruby
Iñtërnâtiônàlizætiøn
Before we can talk about HTTP content types, or about URL escape sequences, or about XML encoding, we need to tackle the topic of what exactly a character is.
Now admit it. Despite the fact that you know that there are some excellent introductions to Unicode out there, you really don't want to learn Unicode, do you? What you really want is for someone to tell you what the minimum you need to do to not get your data garbled in transit or outright rejected upon receipt. So what I am going to provide here is a survival guide of sorts. What’s a survival guide, you ask? Well, before you travel to a foreign country that speaks a language you don't know, it always helps to learn a few key phrases like “parlez vous anglais?” or “una cerveza por favor”.
OK, so we’ve established that you’ve got a tool that you want to ensure is internationalized. The first thing I want you to do is to copy the string
Iñtërnâtiônàlizætiøn
into your tool and observe what comes out the other side. If you have a weblogging tool, put it in the title, summary, content, and any other nook and cranny you can find. Comments too, if you have got them. Check every output that can be produced, including html and feeds.
Here are some common examples of garbled output:
cp437 I¤t‰rnƒti“n…liz‘ti?n macroman I–t‘rn‰ti™nˆliz¾ti¿n utf-8 Iñtërnâtiônà lizætiøn
If what you see matches one of the values in the second column, then the value in the first column is the equivalent of “parlez vous anglais?” You can use this later. Unfortunately, “cp437” and “macroman” are not widely “spoken”, so you really should seek alternatives. If you have “utf-8”, you are actually ahead of the game.
Cleaning windows
If you can successfully process text with Diacritic marks, you are likely using either iso-8859-1, or its close cousin, windows-1252. There is no polite way to put this, so if you are running on a Microsoft platform, or cut and paste from documents produced by Microsoft software, or even allow comments to be posted by people who might be doing one of the above, you need to be aware of the 27 differences, summarized by the table below:
character win-1252 unicode decimal hex octal html xml url € 128
80
200
€
€
%E2%82%AC
‚ 130
82
202
‚
‚
%E2%80%9A
ƒ 131
83
203
ƒ
ƒ
%C6%92
„ 132
84
204
„
„
%E2%80%9E
… 133
85
205
…
…
%E2%80%A6
† 134
86
206
†
†
%E2%80%A0
‡ 135
87
207
‡
‡
%E2%80%A1
ˆ 136
88
210
ˆ
ˆ
%CB%86
‰ 137
89
211
‰
‰
%E2%80%B0
Š 138
8A
212
Š
Š
%C5%A0
‹ 139
8B
213
‹
‹
%E2%80%B9
Œ 140
8C
214
Œ
Œ
%C5%92
Ž 142
8E
216
Ž
Ž
%C5%BD
‘ 145
91
221
‘
‘
%E2%80%98
’ 146
92
222
’
’
%E2%80%99
“ 147
93
223
“
“
%E2%80%9C
” 148
94
224
”
”
%E2%80%9D
• 149
95
225
•
•
%E2%80%A2
– 150
96
226
–
–
%E2%80%93
— 151
97
227
—
—
%E2%80%94
˜ 152
98
230
˜
˜
%CB%9C
™ 153
99
231
™
™
%E2%84%A2
š 154
9A
232
š
š
%C5%A1
› 155
9B
233
›
›
%E2%80%BA
œ 156
9C
234
œ
œ
%C5%93
ž 158
9E
236
ž
ž
%C5%BE
Ÿ 159
9F
237
Ÿ
Ÿ
%C5%B8
The way to read this table is that the first column shows what the character should look like. The next three columns describe the input character in decimal, hex, and octal. The final three columns tell you what to replace this character with if your target is html, xml, or a url.
Note: as these code points are defined as control characters in iso-8859-1, you are safe to assume that you will not encounter any of these bytes in the normal course of processing user input; and therefore mapping these characters is a safe thing to do.
Alternatively, you can declare defeat, and use windows-1252 as your encoding. However, as this encoding is not univerally recognized, you may be shutting out a portion of your intended audience.
The next step
If all you are looking to do the absolute minimum, there is no problem stopping when you successfully get to iso-8859-1. The following is for people who want to make an informed decision on utf-8. First, the steps to convert from iso-8859-1 to utf-8:
- All characters in the range of 0-127 (hex 00 through 7F), are represented identically in both encodings. This covers the entire range of the original ASCII characters.
- All iso-8859-1 characters in the range of 128-191 (hex 80 through BF) need to be preceeded by a byte with the value of 194 (hex C2) in utf-8, but otherwise are left intact.
- All iso-8859-1 characters in the range of 192-255 (hex C0 through FF) not only need to be preceede by a byte with the value of 195 (hex C3) in utf-8, but also need to have 64 (hex 40) subtracted from the iso-8859-1 character value. For example, a “ñ” (decimal 241, hex F1) becomes a 195 followed by a 177 (hex C3 B1).
Or, if you prefer, here's some working code.
The primary advantage of utf-8 is that it can directly represent all Unicode characters (except Klingon). With other encoding systems there are only partial solutions like numeric character references that work in many important contexts like (HTML and XML), but not in others (like FORM inputs).
The primary disadvantage of utf-8 is that some less frequently used characters will require additional bytes in order to be stored or transmitted.
Further reading
- Introduction/motivation
- Reference
- A tutorial on character code issues by Jukka "Yucca" Korpela.
- The character encoding problem by Peter Verthez
- Introduction to i18n by Tomohiro KUBOTA
- The Properties and Promizes of UTF-8 by Martin J. Dürst
- Tool support
- ecto and character encoding by Adriaan Tijsseling
- MTStripControlChars by Jacques Distler
- Mozilla bugs: 18643, 81203, 228779
- Displaying international characters in JSP