It’s just data

JSON Interop

Python’s simplejson, in an apparent attempt to avoid Unicode issues, defaults to encoding all non-ASCII characters using JSON’s \uXXXX syntax.  Ironically, this causes problems with, of all languages, JavaScript:

$ js
js> load('json.js')
js> print("\u263A".toJSONString());
":"
js> print(unescape(encodeURIComponent("\u263A".toJSONString())));
"☺"

The second, rather unobvious combination, converts Unicode to utf-8 and produces the correct result.  A workaround on the Python side would be:

$ python
>>> import simplejson
>>> simplejson.dumps("\u263A",ensure_ascii=False).encode('utf-8')
'"\xe2\x80\x99"'

Update: bug 397215 has been opened on the SpiderMonkey shell, and a compile time switch is already available to handle UTF-8 correctly. See the comments for details


It looks like a problem with the js shell (or in json.js, but I doubt it), as both forms work for me at the CouchDB shell (http://localhost:8888/_utils/shell.html):

"\u263a"
☺
print("\u263a")
☺

The same strings print a colon in console js. So it might well be an artifact of the output routines, something that is regretably common.

Posted by Santiago Gala at

The problem has many facets.

So, all in all, yes I would agree: the problem lies in the js shell.  It was apparently not meant for serious work, at least not work that routinely deals with non-ASCII characters.  That doesn’t mean that it won’t be employed for serious work, even work that involves exposure to characters from the Orient, eastern Europe, or even an odd Latin character or two; just that such work needs to be aware of, and compensate for, these shortcomings.

Posted by Sam Ruby at

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=397215 on the print() bug.

I don’t recommend using the encodeURIComponent + unescape trick.  There are actual characters at many of the code points between 128 and 255, so code that tries to outsmart print() like that is going to result in buggy behavior once bug 397215 is fixed.

Posted by Jesse Ruderman at

The shell is an old testing REPL which really should not be used for serious work without more testing and bug-fixing. Like the kind happening here, for instance ;-).

The SpiderMonkey “js” shell and API go back to pre-Unicode days in 1996. We added Unicode but did not revisit the shell or its dependencies such as readline. With the relatively new compile-time JS_C_STRINGS_ARE_UTF8 option, we could. Patches are welcome as usual; this is not something I’m going to have time for personally, but I agree it should be fixed.

/be

Posted by Brendan Eich at

[link]...

Excerpt from del.icio.us/tag/python at

With JS_C_STRINGS_ARE_UTF8:

$ js utf8_test.js < testdata
text: 2. Russian: На берегу пустынных волн
[50, 46, 32, 82, 117, 115, 115, 105, 97, 110, 58, 32, 1053, 1072, 32, 1073, 1077, 1088, 1077, 1075, 1091, 32, 1087, 1091, 1089, 1090, 1099, 1085, 1085, 1099, 1093, 32, 1074, 1086, 1083, 1085]

Without:

$ js utf8_test.js < testdata
text: 2. Russian: На берегу пустынных волн
[50, 46, 32, 82, 117, 115, 115, 105, 97, 110, 58, 32, 208, 157, 208, 176, 32, 208, 177, 208, 181, 209, 128, 208, 181, 208, 179, 209, 131, 32, 208, 191, 209, 131, 209, 129, 209, 130, 209, 139, 208, 189, 208, 189, 209, 139, 209, 133, 32, 208, 178, 208, 190, 208, 187, 208, 189]

Looks like JS_C_STRINGS_ARE_UTF8 solves the problem mentioned above.

Posted by Sam Ruby at

Meanwhile, people reading (or echoing) your original post / feed, may believe there is merit to your claims about bugginess attributable to javascript (as you usually have good merit to your claims), and not some particular broken javascript console.

I’m not sure how you usually address updates, but it wouldn’t hurt to name that javascript shell as broken on non-ASCII I/O and off by one UTF8 conversion instead. That makeshift UTF8 encoder in the post unfortunately just looks like incantations of black magic to any reader less than deeply familiar with the (rather fortunate!) circumstances that makes it work, adding needless further mysticism to the claim.

(Even I reacted that way on it, and I made the connection only from recognizing my code. :-)

I think there’s some issues with your OpenID delegation handling, by the way; on referring the URI http://ecmanaut.blogspot.com/ (actually without noting it to be a field for OpenID identification, but since it should delegate to http://ecmanaut.myopenid.com/ I’d probably be fine anyway, if we all did things right), I eventually landed on a CGI failure of yours, citing this backtrace:

traceback:Traceback (most recent call last):
  File "gateway.cgi", line 47, in ?
    identity.validate(dict(cgi.parse_qsl(os.environ['QUERY_STRING'])))
  File "/home/rubys/mombo/identity.py", line 55, in validate
    file = writeComment(session['parent'],title,body,decache=True)
  File "/home/rubys/mombo/post.py", line 240, in writeComment
    raise Exception(message)
Exception: POST limit exceeded


Posted by Johan Sundström at

Add your comment