3 + 1 = 2
I’ve got portions of HTML5lib working on Ruby 1.9, enough to pass Mars's unit tests. My initial reaction to Ruby 1.9’s support isn’t favorable. I definitely like Python 3K's Unicode support better. This feels closer to Python 2.5. In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.
The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over. An example that fails with Ruby 1.9:
[0x2639].pack('U') + "\u2639"
The error that is produced is ArgumentError: character encodings differ
. The left hand side specifies packing as UTF-8. The right hand side is expressed as Unicode, which Ruby represents as #<Encoding:UTF-8>
. The problem is that the left hand side is actually stored as #<Encoding:ASCII-8BIT>
which is a misnomer. In many ways this mirror’s Python 2.x’s <type 'str'>
vs <type 'unicode'>
except that with Ruby 1.9 both Strings are the same type.
Ruby 1.9 both mitigates and compounds the problem by providing a number of implicit conversions. Sometimes. Take a look at this code which produces this output. Specifically, look at rows 2 and 4, where two Strings, of the same type, encoding, length, and value produce different results when concatenated with UTF-8 strings. This type of magic destroys any confidence I have in unit testing as a viable strategy.
Update: no magic, just a bug.
My preference would be that #<Encoding:ASCII-8BIT>
be abolished, in favor of #<Encoding:ASCII-7BIT>
and a separate Bytes class. Generally, programmers would only see objects of class Bytes if they do “binary” file I/O, explicitly create constants of that type, or invoke methods such as String#bytes
.
Other suggestions:
Array#pack('U')
should behave like.map {|n| n.chr('UTF-8')}.join
If Ruby is going to support the specification of the default encoding on the command line, it should support Locale environment variables too.
If REXML is going to remain in the core libraries for Ruby, is should have a thorough audit. As XML is defined in terms of Unicode, REXML should never return binary strings. It also needs to be checked to prevent things like this from showing through:
rexml/element.rb:555: warning: Hash#index is deprecated; use Hash#key
Frankly, I’m a bit concerned that REXML is essentially unmaintained at this point: the mailing list is unresponsive, and bug reports appear to be addressed sporadically and new releases all too often seem to produce a regressions.