It’s just data

The Pile of Poo Test™

Mathias Bynens: Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩) in a string, and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it.

Interesting.

You’ve been dabbling in various Ruby→Javascript translator (Opal et al), recently. How good are they at handling 💩 ?

Posted by Jacques Distler at

Mostly, I’ve been developing Ruby2JS.  It expresses strings as UTF-8 (i.e., it doesn’t attempt to use \u notation).  So:

ruby -r ruby2js -e 'puts Ruby2JS.convert("console.log \"\\u{1F4A9}\"")'

Results in:

console.log("💩")

This seems correct, but piping that into node.js results in:

Posted by Sam Ruby at

So, the short answer is, “Not well.”

This seems correct, ...

Except that (as explained in the article), Javascript (in particular, node.js) doesn’t support UTF-8 strings. U+1F4A9 (💩) is represented as 0xF09F92A9 in UTF-8, but as (the surrogate pair) 0xD83DDCA9 in Javascript.

Looks like you need to implement a filter, in Ruby2JS, to convert all Ruby strings to whatever you call that bastardized representation.

Posted by Jacques Distler at

I don’t see any mention of utf-8 in the article.  In fact, the article seems to focus exclusively on problems using escape sequences.

In any case, both Firefox and Chrome appear to support utf-8 encoded Astral Characters.  Node on the other hand not only doesn’t seem to handle utf-8, it also doesn’t appear to support the proposed workaround:

> console.log('\uD83D\uDCA9')
������

If somebody can identify an environment which supports astral characters encoded as surrogate pairs but does not support astral characters expressed as utf-8, I’ll implement a conversion

value.inspect.gsub(/[^\u0000-\uFFFF]/) do |c|
  h = ((c.ord - 0x10000) / 0x400) + 0xD800
  l = (c.ord - 0x10000) % 0x400 + 0xDC00
  "\\u#{h.to_s(16).upcase.rjust(4,'0')}\\u#{l.to_s(16).upcase.rjust(4,'0')}"
end
Posted by Sam Ruby at

Hmmm. Sorry, you’re right. Browsers seem to agree that

alert('length(💩)='+'💩'.length);

should produce

length(💩)=2

which is half-right. The article’s discussion is all about the internal representation of Unicode strings, not the ability to read/write utf-8 (which is, I guess, a-priori independent).

Posted by Jacques Distler at

Add your comment