It’s just data

Astral-Plane Characters in Json

In Characters vs. Bytes, Tim Bray mentions a Gothic letter faihu.  Whether such a character will display properly in your browser depends on what operating system you use and what fonts you have installed.  Whether or not you can handle such characters programmaticly, however, depends on what programming language you use.

I’ve taken a closer look at three such languages.

JavaScript

With spidermonkey, the story is actually pretty good.  Data internally in JavaScript (and in JSON) in UTF-16/UCS-2, so the faihu character has to be represented as two surrogate characters.  It does mean that .length will produce the “wrong” results, but otherwise he conversion from and to JSON all work, and output is (or will be once this patch works its way through the system) properly converted to and from utf-8.

faihu_json.js: output

Python

Python can be compiled with --enable-unicode=ucs4, and I’m pleased to say that the version of Python that comes with Ubuntu appears to have done so.  With this support in place, Python programs can operate directly on Unicode strings that contain astral characters; and with simplejson such programs can produce correct JSON.  Unfortunately, when parsing JSON, characters such as faihu are represented as two surrogate characters.  This can easily be corrected by simply converting to either utf-8 or utf-16 and back.

faihu_json.py: output

Ruby

As of Ruby 1.8.6, Ruby has no intrinsic support for Unicode, but by setting $KCODE to UTF8 most libraries will behave reasonably for BMP characters.  Unfortunately, the json gem (in its “pure” form as neither libjson-ruby or gem install json will install the C extension variant on Ubuntu) does not handle astral characters correctly.  By “not handling astral characters correctly” I mean that unparse produces syntacticly correct but nonsensical output, and produces a runtime exception when asked to parse the very same JSON it produced.  The code below demonstrates both the problem and a monkey-patch that compensates for this behavior.

faihu_json.rb: output


Sam Ruby: Astral-Plane Characters in Json

[link] [more]...

Excerpt from programming: what's new online at

I would have thought that unicodedata.normalize('NFC', string) would handle the surrogate issue.  But it doesn’t on my system.  I’m not quite sure why.  Maybe unicodedata doesn’t know about this particular character...?

Posted by Ian Bicking at

The issue of astral plane characters and surrogates is orthogonal to the Normalization Form.  In particular, it is not a matter of this specific character, but a matter how to handle all of the Unicode characters that are not in the Basic Multilingual Plane.

Posted by Sam Ruby at

I’m being a bit picky, but I think you must mean UTF-16 with respect to Spidermonkey, not UCS-2. The latter only covers the Basic Multilingual Plane, and excludes what Unicode calls the surrogate code points.

Posted by Carey at

My impression is that SpiderMonkey stores character data in (largely uninterpreted) 16 bits.  Some code (like .length) interprets such data as UCS-2, other code interprets such data as UTF-16.

I guess what I am saying is that in order to deal with astral characters in JavaScript one needs to pretend that JavaScript strings are UTF-16 and cross your fingers.

Posted by Sam Ruby at

The faihu character has to be represented as two surrogate characters.

Technically there is no such thing as a surrogate character. I know what you mean, of course. Just a pedantic FYI.

Posted by James Holderness at

Let’s not forget Old Faithful:

#!/usr/bin/perl
use strict;
use warnings;

use JSON::XS;
use Data::Dumper;
use Encode qw( encode decode );

use Test::More tests => 3;

my ( $faihu, $faihu_json, $roundtrip, $js ) = ( "\x{10346}" );

$js = JSON::XS->new->allow_nonref->ascii;
diag( Dumper $faihu_json = $js->encode( $faihu ) );
diag( Dumper $roundtrip = $js->decode( $faihu_json ) );
is( $roundtrip, $faihu, 'JSON in ASCII roundtrips correctly' );

$js = JSON::XS->new->allow_nonref->utf8;
diag( Dumper $faihu_json = $js->encode( $faihu ) );
diag( Dumper $roundtrip = $js->decode( $faihu_json ) );
is( $roundtrip, $faihu, 'JSON in UTF-8 roundtrips correctly' );

$js = JSON::XS->new->allow_nonref;
diag( Dumper $faihu_json = encode 'UTF-16BE', $js->encode( $faihu ) );
diag( Dumper $roundtrip = $js->decode( decode 'UTF-16BE', $faihu_json ) );
is( $roundtrip, $faihu, 'JSON with external recoding roundtrips correctly' );

Output:

1..3
# $VAR1 = '"\\ud800\\udf46"';
# $VAR1 = "\x{10346}";
ok 1 - JSON in ASCII roundtrips correctly
# $VAR1 = '"𐍆"';
# $VAR1 = "\x{10346}";
ok 2 - JSON in UTF-8 roundtrips correctly
# $VAR1 = '"ØßF"';
# $VAR1 = "\x{10346}";
ok 3 - JSON with external recoding roundtrips correctly


Posted by Aristotle Pagaltzis at

LShift on Erlang: Astral Plane characters in Erlang JSON/RFC4627 implementation

Sam Ruby examines support for astral-plane characters in various JSON implementations . His post prompted me to check my Erlang implementation of rfc4627 . I found that for astral plane characters in utf-8, utf-16, or utf-32, everything worked...

Excerpt from Planet Trapexit - Erlang/OTP News at

links for 2007-11-18

Boboroshi | The Issue With Social Networks A good overview of the frustrations in working with multiple social networks, and a little musing on some tools that can be used to glue them together. (tags: aggregation socialnetworking) Facebook |...

Excerpt from a work on process at

I’d have thought that Python in UCS2 mode (which is actually UTF16, handling surrogates) would have supported such a character to the same degree as Spidermonkey (represent it as a pair of surrogates, but return a length of two).  Was that not the case?

Posted by James Henstridge at

The Links

Ruby and Rails RubyGems 0.9.5 Install rmagick on Leopard ActionMailer quicktips: receiving and parsing mail New TextMate Theme for Ruby, HAML, Rails, and more! Zed Shaw on Mongrel, Ruby stacks, and languages besides Ruby Instantly add wiki...

Excerpt from roarin' reporter at

SpiderMonkey follows ECMA-262 Edition 3, which clearly specifies storing, indexing, and counting characters in strings using 16-bit storage units. The first note in Chapter 6 “Source Text” is:

NOTE Although this document sometimes refers to a “transformation” between a “character” within a “string” and the 16-bit
unsigned integer that is the UTF-16 encoding of that character, there is actually no transformation because a “character” within a
“string” is actually represented using that 16-bit unsigned value.

For JS2/ES4, we hoped to do better, without requiring all implementations to change how they index and reckon lengths. See the update unicode proposal and its discussion page.

But this leaves open issues, such as bugs.ecmascript.org #37.

And it’s a bold move that might bite back with interop bugs, if (say) China favors an ES4 implementation that embraces UTF16 fully, web content is developed accordingly, and then the world-wide-ness of the web puts that content through ES3-style implementations. But we don’t want to impose non-constant-time indexing and length computing (obvious optimizations notwithstanding) on all ES4 implementations, and we do not want to close the door on full UTF16. So, a “allow both” compromise, or invitation to experiment.

/be

Posted by Brendan Eich at

SpiderMonkey follows ECMA-262 Edition 3, which clearly specifies storing, indexing, and counting characters in strings using 16-bit storage units.

Fair enough.  I changed it to put “wrong” in scare quotes.

Posted by Sam Ruby at

I think the result for python has to be considered a bug in simplejson. In a UCS4 build of python loads should recompress surrogate pairs into characters.

Current behavior is right for a UCS2 build, as  Astral planes characters can only be represented via surrogate pairs in those systems, but in UCS4 builds simplejson.loads should return proper characters, instead of improper surrogate pairs, I guess. Opinion of expert unicoders welcome.

I have testcases  and a patch that does this for UCS4 builds only (distinguished by sys.maxunicode > 66000), and I’m looking for a simplejson bug tracker where I can drop it for further improvement. Any clue?

Not a biggie until the moment philologists start using simplejson based stuff.

Also:

James Henstridge on UCS2 builds of python

Was that not the case?

It is, at least for jython (the only UCS2-alike build I have at hand). For those builds my patch keeps simplejson current behavior.

Posted by Santiago Gala at

Add your comment