Calling JAXP from Ruby

2009-06-18T00:38:11Z

Demo:

ruby domencoding.rb test.xml
iso-8859-7

java domencoding test.xml
iso-8859-7

./domencoding test.xml
iso-8859-7

cd jaxp; ruby -r jaxp -e 'include Jaxp; puts parse("../test.xml")'
iso-8859-7

Backstory:

First, a very simple program in Ruby using an excellent XML parser (based on libxml2 with the API of Hpricot).
Next, an equivalent program in Java using JAXP.
Then, the same program in C++ using JAXP via CNI. Oddly, I get a NullPointerException if I call parse with a file name, but it works if I first construct a FileInputStream and pass that. For now, I’ve left in the catching of the Java exception and printing of the stack and decided to press on.
Finally, I wrap the C++ code into a module that can be called by Ruby. Strings are a bit of a pain, but at least that logic can be refactored into a separate function and reused.

Next step would be to make the API mirror the remainder of the Nokogiri/Hpricot API. Care will have to be taken to make sure that the Java objects are “pinned” at the appropriate time, don’t leak, and no cycles are created in the graph, but all that should be doable.

Future work could be to convert CNI to JNI. Or to make Henri’s C++ translator produce output that is independent of Mozilla libraries (and potentially make use of Ruby libraries for things like heap management). That, and repeat this exercise for languages like Python and PHP.

Why is this a big deal? Well, validator.nu has a HtmlDocumentBuilder that complies with HTML 5. Furthermore, JAXP has full support for XPath and Nokogiri has the ability to convert CSS into XPath:

$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'nokogiri'
=> true
irb(main):003:0> cssparser = Nokogiri::CSS::Parser.new
=> #<Nokogiri::CSS::Parser:0x7f576c31cb10 @namespaces={}>
irb(main):004:0> cssparser.xpath_for('p.foo a[href$=".pdf"]')
=> ["//p[contains(concat(' ', @class, ' '), ' foo ')]//a[substring(@href, string-length(@href) - string-length(\".pdf\") + 1, string-length(\".pdf\")) = \".pdf\"]"]

This may seem like going the long way around, but my intuition is that the expensive parts will be transferring data across the language boundary (especially strings). Having a DOM entirely on the Java side, probed via CSS selectors and/or XPath expressions and only retrieving the specific nodes across the Java/Ruby boundary for further processing should minimize this. And the end result should be able to pass all of the Nokogiri and Validator.nu tests.

Best of breed Ruby API with best of breed HTML parser. What’s not to like?