It’s just data

The Pile of Poo Test™

Mathias Bynens: Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩) in a string, and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it.


You’ve been dabbling in various Ruby→Javascript translator (Opal et al), recently. How good are they at handling 💩 ?

Posted by Jacques Distler at

Mostly, I’ve been developing Ruby2JS.  It expresses strings as UTF-8 (i.e., it doesn’t attempt to use \u notation).  So:

ruby -r ruby2js -e 'puts Ruby2JS.convert("console.log \"\\u{1F4A9}\"")'

Results in:


This seems correct, but piping that into node.js results in:

Posted by Sam Ruby at

So, the short answer is, “Not well.”

This seems correct, ...

Except that (as explained in the article), Javascript (in particular, node.js) doesn’t support UTF-8 strings. U+1F4A9 (💩) is represented as 0xF09F92A9 in UTF-8, but as (the surrogate pair) 0xD83DDCA9 in Javascript.

Looks like you need to implement a filter, in Ruby2JS, to convert all Ruby strings to whatever you call that bastardized representation.

Posted by Jacques Distler at

I don’t see any mention of utf-8 in the article.  In fact, the article seems to focus exclusively on problems using escape sequences.

In any case, both Firefox and Chrome appear to support utf-8 encoded Astral Characters.  Node on the other hand not only doesn’t seem to handle utf-8, it also doesn’t appear to support the proposed workaround:

> console.log('\uD83D\uDCA9')

If somebody can identify an environment which supports astral characters encoded as surrogate pairs but does not support astral characters expressed as utf-8, I’ll implement a conversion

value.inspect.gsub(/[^\u0000-\uFFFF]/) do |c|
  h = ((c.ord - 0x10000) / 0x400) + 0xD800
  l = (c.ord - 0x10000) % 0x400 + 0xDC00
Posted by Sam Ruby at

Hmmm. Sorry, you’re right. Browsers seem to agree that


should produce


which is half-right. The article’s discussion is all about the internal representation of Unicode strings, not the ability to read/write utf-8 (which is, I guess, a-priori independent).

Posted by Jacques Distler at

Go as well as grab this incredible deal via this study Winners are responsible for all the appropriate tax obligations.

Posted by andrew at

Add your comment