REXML and Mangled Text
Rick Blommers: ReXML seems to escape items very nicely when setting values. But it doesn’t unescape the values with REXML::Document.new( … )
A bare minimum amount of functionality that one would expect from an XML parsing library is the ability to round-trip data. If you parse a document and immediately reserialize the result, you would expect to get the original back. If you create a DOM, serialize it, and parse the results, you would expect to get the original back. The version of REXML that comes with Ruby 1.8.4 gives you the latter. The version of REXML that comes with Ruby 1.8.6 gives you the former. Neither gives you both.
This test case can be used to explore this situation. When run using Ruby 1.8.6, and you pass nots (no test serializer) as a command line argument, you will see that everything passes. If you pass notp (no test parser) instead, you will see 30 failures. Running with mp notp (monkey patch and no test parser) and everything passes, but running with mp nots and you will see 30 failures.
The root problem is in text.rb. Line 147 will “normalize” (entity encode) @string in response to calls to to_s. Line 174 will “unnormalize” (entity decode) @string in response to calls to value.
The key question is: is @string already entity encoded (in which case normalize will double encode it)? Or is @string already entity decoded (in which case value will double decode it). The answer can be found in @raw. If it is set, the attribute is assumed to be entity encoded, in which case to_s simply returns it. If it is not set (the default), you would assume that the reverse would be true, but no such short circuiting exists in value. Additionally, the keyword return is missing in the first line of value, eliminating a potential optimization.
There are other issues with the code. For example, try REXML::Text.unnormalize('&') (which works as expected) and REXML::Text.unnormalize('&&') (which doesn’t).
“when the world ends, the only things left will be cockroaches, rats, Keith Richards, and mangled text that has been escaped one-too-many or one-too-few times” — Dave Walker
The two things I have yet to find is where I can SVN checkout the latest code, and how to run the exiting set of tests. I would like to submit new tests which expose the problems I have found so far, and patches to correct these issues. Ideally in time for 3.1.8.
Pointers appreciated.
svn co http://www.germane-software.com/repos/rexml/trunk/works for me.
Thanks!
There’s a bin/suite.rb to run the test suite.
346 tests, 1225 assertions, 10 failures, 8 errors
:-(
Posted by Sam Ruby atRunning it with simply
bin/suite.rb (as opposed to ruby bin/suite.rb, reduces this down to one error: No such file or directory - test/xml/ticket_110_utf16.xml. I can work with that.
Posted by Sam Ruby at
Mankind’s ability to write software is far in advance of mankind’s ability to determine what it does (or does not) do. Scary, but true.
Also, from a commercial legal point of view, it is a question of ‘Do I have the right to distribute this software’ (for any price or for none). Consideration of whether it solves any problem for a client ... or indeed whether it solves any problem at all ... comes a distant second.
Posted by Chris Ward atSadly, undeniably true
... at least the last bit: “when the world ends, the only things left will be cockroaches, rats, Keith Richards, and mangled text that has been escaped one-too-many or one-too-few times” — Dave Walker (found this little gem via Sam Ruby )...Excerpt from Steven Wittens - Acko.net blogs at
svn co http://www.germane-software.com/repos/rexml/trunk/works for me.There’s a
Posted by Arien atbin/suite.rbto run the test suite.