intertwingly

It’s just data

REXML on Expat


Tim Bray: Ruby has a kind of stand-offish attitude towards two of my favorite pieces of infrastructure, XML and Unicode. REXML provides a nice API, but, as Sam Ruby discovered, has big-enough holes that you can’t point it at Arbitrary Internet XML and hope for good results.

I talked to Tim about this at OSCON, and took a look at it on the plane ride back.  It also gives me an opportunity to demonstrate something I only talked about previously: namely converting a SAX parser to a pull parser via continuations.

Preparation

First, one needs to install the Ruby interface to Expat.  With Ubuntu Dapper:

sudo apt-get install libxml-parser-ruby1.8

With other operating systems, it is somewhat harder.

Proof of Concept

The implementation is pretty simple.  It simply calls the parser, and for each event it receives, it reformats the data into the structure that the REXML::Parsers::BaseParser produces.

Two methods: push and pull handle the context switches, and they are very simple: a single call to callcc (saving the current stack) and a single call to call to resume execution on the other stack.  Priming the pump involves a single additional usage of callcc coupled with a return statement, forking the stack.

Included is a small but representative set of test cases.  Some ensure that the events produced by this code exactly matches the ones produced by the BaseParser.  Others verify that a specific event is produced given a specific input.

I’ve also produced a simple demo application.  A real application would understand Atom’s procesing model.

Future

First, I must stress that this is just a proof of concept at this point.  REXML’s base parser, for example, doesn’t resolve entity references.  If this is compensated for in other places in the code, the results would be incorrect with Expat.  A complete audit and test suite is in order, bringing the semantics of REXML’s base parser in line with the other XML parsers.

Ideally, the code to allow other parsers to be registered would also be accepted into the REXML code base.  Additionally, the parameter which allows one to select which parser to use would need to be propagated up into interfaces like Document.new.

For Expat, there is more work that needs to be done.  The Expat-Ruby interface does not provide enough information to fully construct the corresponding DTD events.  A full pull parser interface also includes methods such as unshift and peek.  Assuming the REXML registration code is accepted, and given the popularity of REXML in the Ruby community, all of this could be coded in C and included with the Expat-Ruby module itself.

Similar efforts could also be made for other parsers, such as libxml2 and xerces-c.  One could then pick the parser one desires based on considerations such as performance, functionality, or portability.