It’s just data

3 + 1 = 2

I’ve got portions of HTML5lib working on Ruby 1.9, enough to pass Mars's unit tests.  My initial reaction to Ruby 1.9’s support isn’t favorable.  I definitely like Python 3K's Unicode support better.  This feels closer to Python 2.5.  In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.

The problem is one that is all to familiar to Python programmers.  You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over.  An example that fails with Ruby 1.9:

[0x2639].pack('U') + "\u2639"

The error that is produced is ArgumentError: character encodings differ.  The left hand side specifies packing as UTF-8.  The right hand side is expressed as Unicode, which Ruby represents as #<Encoding:UTF-8>.  The problem is that the left hand side is actually stored as #<Encoding:ASCII-8BIT> which is a misnomer.  In many ways this mirror’s Python 2.x’s <type 'str'> vs <type 'unicode'> except that with Ruby 1.9 both Strings are the same type.

Ruby 1.9 both mitigates and compounds the problem by providing a number of implicit conversions.  Sometimes.  Take a look at this code which produces this output.  Specifically, look at rows 2 and 4, where two Strings, of the same type, encoding, length, and value produce different results when concatenated with UTF-8 strings.  This type of magic destroys any confidence I have in unit testing as a viable strategy.

Update: no magic, just a bug.

My preference would be that #<Encoding:ASCII-8BIT> be abolished, in favor of #<Encoding:ASCII-7BIT> and a separate Bytes class.  Generally, programmers would only see objects of class Bytes if they do “binary” file I/O, explicitly create constants of that type, or invoke methods such as String#bytes.

Other suggestions: