intertwingly

It’s just data

Calling JAXP from Ruby


Demo:

ruby domencoding.rb test.xml
iso-8859-7

java domencoding test.xml
iso-8859-7

./domencoding test.xml
iso-8859-7

cd jaxp; ruby -r jaxp -e 'include Jaxp; puts parse("../test.xml")'
iso-8859-7

Backstory:

Next step would be to make the API mirror the remainder of the Nokogiri/Hpricot API.  Care will have to be taken to make sure that the Java objects are “pinned” at the appropriate time, don’t leak, and no cycles are created in the graph, but all that should be doable.

Future work could be to convert CNI to JNI.  Or to make Henri’s C++ translator produce output that is independent of Mozilla libraries (and potentially make use of Ruby libraries for things like heap management).  That, and repeat this exercise for languages like Python and PHP.

Why is this a big deal?  Well, validator.nu has a HtmlDocumentBuilder that complies with HTML 5.  Furthermore, JAXP has full support for XPath and Nokogiri has the ability to convert CSS into XPath:

$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'nokogiri'
=> true
irb(main):003:0> cssparser = Nokogiri::CSS::Parser.new
=> #<Nokogiri::CSS::Parser:0x7f576c31cb10 @namespaces={}>
irb(main):004:0> cssparser.xpath_for('p.foo a[href$=".pdf"]')
=> ["//p[contains(concat(' ', @class, ' '), ' foo ')]//a[substring(@href, string-length(@href) - string-length(\".pdf\") + 1, string-length(\".pdf\")) = \".pdf\"]"]

This may seem like going the long way around, but my intuition is that the expensive parts will be transferring data across the language boundary (especially strings).  Having a DOM entirely on the Java side, probed via CSS selectors and/or XPath expressions and only retrieving the specific nodes across the Java/Ruby boundary for further processing should minimize this.  And the end result should be able to pass all of the Nokogiri and Validator.nu tests.

Best of breed Ruby API with best of breed HTML parser.  What’s not to like?