It’s just data

Astral-Plane Characters in Json

In Characters vs. Bytes, Tim Bray mentions a Gothic letter faihu.  Whether such a character will display properly in your browser depends on what operating system you use and what fonts you have installed.  Whether or not you can handle such characters programmaticly, however, depends on what programming language you use.

I’ve taken a closer look at three such languages.


With spidermonkey, the story is actually pretty good.  Data internally in JavaScript (and in JSON) in UTF-16/UCS-2, so the faihu character has to be represented as two surrogate characters.  It does mean that .length will produce the “wrong” results, but otherwise he conversion from and to JSON all work, and output is (or will be once this patch works its way through the system) properly converted to and from utf-8.

faihu_json.js: output


Python can be compiled with --enable-unicode=ucs4, and I’m pleased to say that the version of Python that comes with Ubuntu appears to have done so.  With this support in place, Python programs can operate directly on Unicode strings that contain astral characters; and with simplejson such programs can produce correct JSON.  Unfortunately, when parsing JSON, characters such as faihu are represented as two surrogate characters.  This can easily be corrected by simply converting to either utf-8 or utf-16 and back. output


As of Ruby 1.8.6, Ruby has no intrinsic support for Unicode, but by setting $KCODE to UTF8 most libraries will behave reasonably for BMP characters.  Unfortunately, the json gem (in its “pure” form as neither libjson-ruby or gem install json will install the C extension variant on Ubuntu) does not handle astral characters correctly.  By “not handling astral characters correctly” I mean that unparse produces syntacticly correct but nonsensical output, and produces a runtime exception when asked to parse the very same JSON it produced.  The code below demonstrates both the problem and a monkey-patch that compensates for this behavior.

faihu_json.rb: output