intertwingly

It’s just data

Nokogiri


Yesterday, I was looking to move some code that I have running on one machine to a server which has Ruby 1.8.6 installed.  Once again, I encountered yet another difference with the version of REXML that was contained in that version of Ruby.  This time, instead of looking for a monkey patch, I looked for alternatives.

As of 2.3, Rails continues to default to REXML, but supports Nokogiri and LibXML as faster alternatives.  In addition to being faster, both are more spec compliant, both have HTML parsing capabilities (albeit one that is not tracking to HTML5).  In addition Nokogiri has a superior API (based on hpricot) and support for CSS3 selectors.

Installation on Ubuntu and configuration of Rails:

sudo apt-get install ruby1.8-dev libxml2-dev libxslt1-dev
sudo gem install nokogiri
ActiveSupport::XmlMini.backend='Nokogiri'

Locating an element based on an id, using REXML:

node.elements[//*[@id="sidebar"]

Alternatives using Nokogiri:

node.at('//*[@id="sidebar"')
node.at('#sidebar')
node/'#sidebar'

Extracting an attribute given a node, using REXML:

node.attributes['href']

Using Nokogiri:

node['href']

Individually, the differences don’t seem major, but the effects are cumulative.  Which would you rather write:

REXML::Document.new('<a b="c"/>').elements['//a'].attributes['b']

or

Nokogiri::XML('<a b="c"/>').at('a')['b']

Suffice it to say that Nokogiri is now a part of my toolbox, and likely the first tool I will reach for when dealing with XML/XHTML/HTML content in ways beyond the ability of simple regular expressions.