It’s just data

REXML and Mangled Text

Rick Blommers: ReXML seems to escape items very nicely when setting values.  But it doesn’t unescape the values with REXML::Document.new( … )

A bare minimum amount of functionality that one would expect from an XML parsing library is the ability to round-trip data.  If you parse a document and immediately reserialize the result, you would expect to get the original back.  If you create a DOM, serialize it, and parse the results, you would expect to get the original back.  The version of REXML that comes with Ruby 1.8.4 gives you the latter.  The version of REXML that comes with Ruby 1.8.6 gives you the former.  Neither gives you both.

This test case can be used to explore this situation.  When run using Ruby 1.8.6, and you pass nots (no test serializer) as a command line argument, you will see that everything passes.  If you pass notp (no test parser) instead, you will see 30 failures.  Running with mp notp (monkey patch and no test parser) and everything passes, but running with mp nots and you will see 30 failures.

The root problem is in text.rb.  Line 147 will “normalize” (entity encode) @string in response to calls to to_s.  Line 174 will “unnormalize” (entity decode) @string in response to calls to value.

The key question is: is @string already entity encoded (in which case normalize will double encode it)?  Or is @string already entity decoded (in which case value will double decode it).  The answer can be found in @raw.  If it is set, the attribute is assumed to be entity encoded, in which case to_s simply returns it.  If it is not set (the default), you would assume that the reverse would be true, but no such short circuiting exists in value.  Additionally, the keyword return is missing in the first line of value, eliminating a potential optimization.

There are other issues with the code.  For example, try REXML::Text.unnormalize('&') (which works as expected) and REXML::Text.unnormalize('&&')  (which doesn’t).

“when the world ends, the only things left will be cockroaches, rats, Keith Richards, and mangled text that has been escaped one-too-many or one-too-few times” — Dave Walker

The two things I have yet to find is where I can SVN checkout the latest code, and how to run the exiting set of tests.  I would like to submit new tests which expose the problems I have found so far, and patches to correct these issues.  Ideally in time for 3.1.8.

Pointers appreciated.


svn co http://www.germane-software.com/repos/rexml/trunk/ works for me.

There’s a bin/suite.rb to run the test suite.

Posted by Arien at

svn co http://www.germane-software.com/repos/rexml/trunk/ works for me.

Thanks!

There’s a bin/suite.rb to run the test suite.

346 tests, 1225 assertions, 10 failures, 8 errors

:-(

Posted by Sam Ruby at

Running it with simply bin/suite.rb (as opposed to ruby bin/suite.rb, reduces this down to one error: No such file or directory - test/xml/ticket_110_utf16.xml.  I can work with that.

Posted by Sam Ruby at

Mankind’s ability to write software is far in advance of mankind’s ability to determine what it does (or does not) do. Scary, but true.

Also, from a commercial legal point of view, it is a question of ‘Do I have the right to distribute this software’ (for any price or for none). Consideration of whether it solves any problem for a client ... or indeed whether it solves any problem at all ... comes a distant second.

Posted by Chris Ward at

Sam Ruby: REXML and Mangled Text

[link]...

Excerpt from del.icio.us/ynopoce/2read at

Living In a Fool's Paradise

Why I hate REXML....

Excerpt from Musings at

Sadly, undeniably true

... at least the last bit: “when the world ends, the only things left will be cockroaches, rats, Keith Richards, and mangled text that has been escaped one-too-many or one-too-few times” — Dave Walker (found this little gem via Sam Ruby )...

Excerpt from Steven Wittens - Acko.net blogs at

masklinn on PERL is dead. Long live Perl

Which ones? I’m sure you can find numerous others, but things like [link] (note that Sam hoped for inclusion in 3.1.8, we’re still at 3.1.7 18 months after 3.1.8 was supposed to be

...

Excerpt from programming: what's new online at

Add your comment