libxml2 screams

2003-09-22T19:22:00Z

Inside the libxml2 Python distribution is a few tests. One of them is named, simply enough, xpath. The purpose of this test apparently is to parse a small file, evaluate an xpath expression against it, cleanup, and repeat this in a loop one thousand times. This runs in subsecond time on my machine.

What this leads me to conclude is that libxml2 is optimized for parsing lots of small files. So I tested the theory by running a more realistic query against all of the weblog entries on my site. The result was still subsecond.

Sweet.

That does not mean that I shouldn't migrate to an XML database, but merely that I don't need to do so today.

What it does mean is that I can spend my time thinking about what I want my url space to look like and designing the schema I chose to expose. There are some obvious things, like it makes sense to have all of the structure exposed instead of obscured. And a date format that can easily be collated.

As far as the url space goes, I want to make sure that the results are readily cachable. Thinking about the usage pattern, what I am likely to find is:

not overly complicated queries
queries which are ad-hoc and therefore hard to optimize for
the number of unique queries issued per day is likely to be small
the total number of queries issued (including repetitions) may be highly variable. All it takes is for someone like Jon Udell to post a few links to cause this to happen.

Given this usage pattern, it would seem that my existing cache exactly fits this requirement. Sweet.

I'll probably play with this for a few days before I deploy it publically.