libxml2 screams
Inside the libxml2 Python distribution is a few tests. One of them is named, simply enough, xpath. The purpose of this test apparently is to parse a small file, evaluate an xpath expression against it, cleanup, and repeat this in a loop one thousand times. This runs in subsecond time on my machine.
What this leads me to conclude is that libxml2 is optimized for parsing lots of small files. So I tested the theory by running a more realistic query against all of the weblog entries on my site. The result was still subsecond.
Sweet.
That does not mean that I shouldn't migrate to an XML database, but merely that I don't need to do so today.
What it does mean is that I can spend my time thinking about what I want my url space to look like and designing the schema I chose to expose. There are some obvious things, like it makes sense to have all of the structure exposed instead of obscured. And a date format that can easily be collated.
As far as the url space goes, I want to make sure that the results are readily cachable. Thinking about the usage pattern, what I am likely to find is:
- not overly complicated queries
- queries which are ad-hoc and therefore hard to optimize for
- the number of unique queries issued per day is likely to be small
- the total number of queries issued (including repetitions) may be highly variable. All it takes is for someone like Jon Udell to post a few links to cause this to happen.
Given this usage pattern, it would seem that my existing cache exactly fits this requirement. Sweet.
I'll probably play with this for a few days before I deploy it publically.