It’s just data

Benchmarking XPath enabled libraries

test1 REXML Document.new 31.961s
test2 REXML Marshal::load 10.325s
test3 libxml2 parseFile 1.049s

It might be worth noting that REXML is 100% pure Ruby code, while the Python code is calling a natively compiled library through wrapper code.  It would seem to me that a 15x speed difference between the first & third test isn't too shabby for interpreted code...

Posted by Robert Hahn at

It should also be noted that libxml2 has quite usable Ruby wrappers: http://raa.ruby-lang.org/list.rhtml?name=libxml

Posted by Avdi at

Not too shabby?  Agreed!

Am I willing to switch? Unfortunately, not yet.

Note: file.read for these same files takes 0.136 and 0.156 in Python and Ruby, respectively.

Posted by Sam Ruby at

Sam: my curiosity is piqued.  Your last two posts about Ruby [the language] have been without editorial comment, so it's difficult to determine the context from which you're exploring this.  Now it seems that you're evaluating Ruby as a possible language to do work in.

That's cool, and that you're seemingly going about it in a methodical way is to be commended.

So, would you care to elaborate on what you're doing? Are you evaluating Ruby as a dev language for your toolbox?  What prompted it?  And, since you volunteered the information that you're not willing to switch, why not, and what are you looking for that would make it be worth switching for?

I'm not much of an evangelist. I find Ruby's constructs more intuitive than Perl, and I like that language more, too, but I do most of my dev work in Perl for now.  Inertia is a powerful force. :P

Posted by Robert Hahn at

http://www.google.com/search?q=ruby

Isn't it obvious?  Sam is trying to claim the #1 spot on Google for his name.  Since the Ruby language is his direct competitor, he's co-opting the competition in hopes of becoming (in Googlebot's eyes) the #1 source of all things Ruby.

Posted by Mark at

Sam, I suggest doing a series of posts  on the W3C's "Ruby Annotation" standard: http://www.w3.org/TR/ruby/, in order to cement your Google reign.  Followed, perhaps, by a Ruby Tuesday's restaurant review...

Posted by Avdi at

Avdi: cool!  The libxml2 bindings page appears to be out of date.  The Ruby 0.3.4 bindings appear to be slightly out of sync with the libxml2 2.5.11 release (xmlTreeIndentString appears to be history), but commenting out a few lines and I had the module built.

test3 now runs in 1.049s.  Impressive.

Robert: if I knew where I was planning on ending up, getting there wouldn't be half as much fun.  ;-)

After I started experimenting with XPath, every blog entry and comment is kept in two forms: a blosxom compatible format, and an XML format (Atom with xhtml content).  I'm exploring converting over completely to an XML format, and heard good things about REXML.  This lead to a detour through learning the Ruby language.

Posted by Sam Ruby at

who did these benchmarks?

Posted by Tom Severoc at

Sam:  if you are exploring REXML, then I would like to recommend that you get some email flowing with the author - I was exploring the use of REXML for some scripts, and was having a struggle with some of the concepts because I was trying to use the interface in a DOM-ish way.  The author unnervingly figured out what kind of programmer I was (I have the strongest background in HTML & JavaScript) and coached me through the more Ruby-like syntax which he claimed was easier to read, and faster to boot.  I'm still no expert, but I'm beginning to see the light.

Mind you, I'd be pretty impressed if there was a faster way to do test 1 than you're going about doing, but for some of the other operations, you may find that the XPath interface in REXML is slower than an REXML-native one.

Posted by Robert Hahn at

Let's suppose you want some real speed and functionality.  Mind you, this requires some courage and futuristic glasses.

Send me your Atom feeds and I will convert them to YAML.  Ruby has builtin YAML support which is supra-swift.  I've no idea what the size of these feeds are that you have nor the speed of your system.  But I'm regularly loading the RAA YAML feed (750k) in a fraction of a second.  Sure, YPath support has not matured much.  And I guess it's kind of a leap to make such a suggestion in a public place.  But you're the one who compared Ruby and Python in the first place, right?  So let's try Ruby with YAML.  These two are blood brothers.

Hopefully this isn't an offensive request.  Atom is very cool and the world revolves around so saying.  But this is Seabiscuit talking.  Post your files.

Posted by why the lucky stiff at

Cool!  A friendly challenge.  Just so that we are clear on what the end goal is: I would like to support these queries on this data.  Proposals welcome.

P.S., I'm getting a parse error on that document with Ruby 1.8 on Linux:

require 'net/http'
require 'yaml'
raa=Net::HTTP.new('raa.ruby-lang.org').get('/raa-yaml.yml').body
YAML.load_documents(raa) {}
Posted by Sam Ruby at

I would like to note that the latest stable release of PyYaml (SHowell branch) still has at least one data-corrupting bug which has bitten me personally: a string value containing HTML markup and carriage returns, when deserialized and reserialized, ends up being written out in a non-compliant form that chokes the parser and prevents the entire file from being loaded ever again.

The maintenance of PyYaml was taken over this spring, but there have been no releases yet under the new maintainer.

I've used YAML for several small private projects, but I won't be using it for any future ones, and I'm in the process of migrating away from it in existing projects.

Posted by Mark at

The gauntlet is thrown, then.  If you see me running, try and keep up.  I'll report back here when it's all cooked up.

Pardon the RAA feed.  Needs a bit of work. Here's a patched version of the RAA feed that will work for you. 

PyYaml is in a coma for the time being.  But work has progressed much on the Syck extension for Python.  The parser is quite solid, well tested by my pedantic associates in Japan.

Posted by why the lucky stiff at

I just coded up test6: streaming REXML.  Elapsed time: 65.116s.  I reran tests1 and test2 to ensure that there weren't any environment changes which could have affected the results.  The results of these tests were consistent with the previous results.

Weird.

Posted by Sam Ruby at

Text processing shootout: Ruby vs. Python, YAML vs. XML

It all started on Sam Ruby's weblog, with benchmarking between Ruby (the language) and Python. Python came out as a clear winner, although I'd say the comparison was a bit unfair (compairing REXML, a pure Ruby implementation, with Python bindings...... [more]

Trackback from Notes from my terminal

at

What's so weird? Aren't you still comparing a native Ruby library (REXML) to a Python wrapper around a C library (libxml2)? Of course REXML is going to be much slower.

I see that in one of your comments you actually did test the Ruby wrapper to libxml, and it seems to have outperformed the Python equivalent. Or am I missing something here?

This kind of treatment just seems like nothing more than fanboyism to me. Disappointing. Shall I make a "comparison" between two factorial programs, one in Ruby that uses an extension wrapping -lgmp, and the other a pure Python program?

No, that'd be stupid.

Posted by Jim at

Jim, what I find weird is that the streaming interface to REXML appears to be quite noticably slower (by a factor of six) on this instance data than the document method.  Using the same library and language.

My goal in this exercise is to benchmark XPath enabled libraries (hence the title to this blog entry).

Posted by Sam Ruby at

Benchmark update

In the previous benchmark I compared REXML to libxml2.  In order to get a deeper understanding of where the time was being spent in REXML, I also compared REXML Document.new against Marshal.load of the same data.  The result was that 2/3 of the time wa... [more]

Trackback from Sam Ruby

at

Hi

The results of your benchmark are:

test1 REXML Document.new Ruby 1.8 31.961s
test2 REXML Marshal::load Ruby 1.8 10.325s
test3 libxml2 parseFile Python 2.2 1.876s

With all due respect:
As others have noted you're comparing apples to oranges.

In order to compare Python/libxml2 (interface to a compiled lib) to
Ruby, I suggest to compare it to a Ruby interface to a compiled lib,
and not compare it to a native Ruby lib.
Ruby/libxml2 seems to be a good choice for this task.

From
http://whytheluckystiff.net/arch/2003/10/08/1065655587 :

ruby + libxml  [...] 2.695s
python + libxml [...] 2.776s

Unsurprisingly the two libs exhibit very similar performance,
Ruby/libxml2 even is a bit faster.

If you want to compare the speed of REXML (which is written in Ruby)
to a Python lib, I suggest to choose a native Python lib written in
nothing but Python.

Tobi

Posted by Tobi at

Tobi - you misunderstand what I was attempting to do.  I was not attempting to benchmark languages (the tests themselves do little more than calls the parser), but to benchmark xml libraries (hence the title).

I even created a version of test3 in Ruby and posted the results.

Posted by Sam Ruby at

Chalk Line

Chalk Line Today we've got the equivalent of Monster Garage going on in the text processing world. My favorite charity organization just received a bit of bad press on Intertwingly. Not muckraking stuff. Just some benchmarks, but sometimes you...

Excerpt from whytheluckystiff.net at

I'll tell ya what.  The speed difference between libxml2 on Python and libxml2 on Ruby appears to be entirely due to initial load time of the lib.  Wierd, eh?  (Trombone cadence.)

Yeah:

require 'xml/libxml' takes ~0.05 secs.
import libxml2 takes ~0.30 secs.

Hint to the snakies.

Posted by why the lucky stiff at

Sam

"I was not attempting to benchmark languages (the tests themselves do little more than calls the parser), but to benchmark xml libraries (hence the title)."

I know. It would have been great if the Ruby/libxml2 results were published in the same table as the Python/libxml2 results, since some people might not realize that you are comparing a C lib to a Ruby lib (which doesn't make much sense IMHO since you're exclusively comparing speed).

The other suggestions remain as well: (sorry for quoting myself)

If you want to compare the speed of REXML (which is written in Ruby)
to a Python lib, I suggest to choose a native Python lib written in
nothing but Python.

Tobi

Posted by anonymous at

OK, I've updated the table for posterity.  All the results are now with Ruby 1.8.  The prior results for libxml2 with Python 2.2 was 1.876s.

Posted by Sam Ruby at

Ruby and Code Generation

I've been playing with Ruby the latest couple of weeks (and I'm not the only one ;). We are starting a new project and I'm trying to evangelize the rest of the team to use code generation throughout the build process. While reading Code Generation...

Excerpt from Andres Aguiar's Weblog at

Add your comment