JSON Interop
Python’s simplejson, in an apparent attempt to avoid Unicode issues, defaults to encoding all non-ASCII characters using JSON’s \uXXXX syntax. Ironically, this causes problems with, of all languages, JavaScript:
$ js js> load('json.js') js> print("\u263A".toJSONString()); ":" js> print(unescape(encodeURIComponent("\u263A".toJSONString()))); "☺"
The second, rather unobvious combination, converts Unicode to utf-8 and produces the correct result. A workaround on the Python side would be:
$ python
>>> import simplejson
>>> simplejson.dumps("\u263A",ensure_ascii=False).encode('utf-8')
'"\xe2\x80\x99"'
Update: bug 397215 has been opened on the SpiderMonkey shell, and a compile time switch is already available to handle UTF-8 correctly. See the comments for details
The problem has many facets.
- The
printfunction in the js shell simply outputs the least significant byte of every character. Unfortunately, the CouchDB JS Server depends on this exact function. - What appears to be the defacto JSON implementation for JavaScript doesn’t actually handle utf-8. A Unicode character that takes two bytes in utf-8 is represented as two Unicode characters. Though arguably that’s not this library’s fault, as it was handed what allegedly was a String, not a sequence of bytes, which leads us to...
- SpiderMonkey’s Shell provided
readlinefunction merely maps bytes to characters one to one. So it makes a nice match for theprintfunction also provided by the shell.
So, all in all, yes I would agree: the problem lies in the js shell. It was apparently not meant for serious work, at least not work that routinely deals with non-ASCII characters. That doesn’t mean that it won’t be employed for serious work, even work that involves exposure to characters from the Orient, eastern Europe, or even an odd Latin character or two; just that such work needs to be aware of, and compensate for, these shortcomings.
Posted by Sam Ruby atI filed https://bugzilla.mozilla.org/show_bug.cgi?id=397215 on the print() bug.
I don’t recommend using the encodeURIComponent + unescape trick. There are actual characters at many of the code points between 128 and 255, so code that tries to outsmart print() like that is going to result in buggy behavior once bug 397215 is fixed.
Posted by Jesse Ruderman atThe shell is an old testing REPL which really should not be used for serious work without more testing and bug-fixing. Like the kind happening here, for instance ;-).
The SpiderMonkey “js” shell and API go back to pre-Unicode days in 1996. We added Unicode but did not revisit the shell or its dependencies such as readline. With the relatively new compile-time JS_C_STRINGS_ARE_UTF8 option, we could. Patches are welcome as usual; this is not something I’m going to have time for personally, but I agree it should be fixed.
/be
Posted by Brendan Eich atWith JS_C_STRINGS_ARE_UTF8:
$ js utf8_test.js < testdata text: 2. Russian: На берегу пустынных волн [50, 46, 32, 82, 117, 115, 115, 105, 97, 110, 58, 32, 1053, 1072, 32, 1073, 1077, 1088, 1077, 1075, 1091, 32, 1087, 1091, 1089, 1090, 1099, 1085, 1085, 1099, 1093, 32, 1074, 1086, 1083, 1085]
Without:
$ js utf8_test.js < testdata text: 2. Russian: На берегу пустынных волн [50, 46, 32, 82, 117, 115, 115, 105, 97, 110, 58, 32, 208, 157, 208, 176, 32, 208, 177, 208, 181, 209, 128, 208, 181, 208, 179, 209, 131, 32, 208, 191, 209, 131, 209, 129, 209, 130, 209, 139, 208, 189, 208, 189, 209, 139, 209, 133, 32, 208, 178, 208, 190, 208, 187, 208, 189]
Looks like JS_C_STRINGS_ARE_UTF8 solves the problem mentioned above.
Posted by Sam Ruby atMeanwhile, people reading (or echoing) your original post / feed, may believe there is merit to your claims about bugginess attributable to javascript (as you usually have good merit to your claims), and not some particular broken javascript console.
I’m not sure how you usually address updates, but it wouldn’t hurt to name that javascript shell as broken on non-ASCII I/O and off by one UTF8 conversion instead. That makeshift UTF8 encoder in the post unfortunately just looks like incantations of black magic to any reader less than deeply familiar with the (rather fortunate!) circumstances that makes it work, adding needless further mysticism to the claim.
(Even I reacted that way on it, and I made the connection only from recognizing my code. :-)
I think there’s some issues with your OpenID delegation handling, by the way; on referring the URI http://ecmanaut.blogspot.com/ (actually without noting it to be a field for OpenID identification, but since it should delegate to http://ecmanaut.myopenid.com/ I’d probably be fine anyway, if we all did things right), I eventually landed on a CGI failure of yours, citing this backtrace:
traceback:Traceback (most recent call last):
File "gateway.cgi", line 47, in ?
identity.validate(dict(cgi.parse_qsl(os.environ['QUERY_STRING'])))
File "/home/rubys/mombo/identity.py", line 55, in validate
file = writeComment(session['parent'],title,body,decache=True)
File "/home/rubys/mombo/post.py", line 240, in writeComment
raise Exception(message)
Exception: POST limit exceeded
Posted by Johan Sundström at
It looks like a problem with the js shell (or in json.js, but I doubt it), as both forms work for me at the CouchDB shell (http://localhost:8888/_utils/shell.html):
"\u263a" ☺ print("\u263a") ☺The same strings print a colon in console js. So it might well be an artifact of the output routines, something that is regretably common.
Posted by Santiago Gala at