intertwingly

It’s just data

Venus in the Clouds


Gordon Hodgson: the host has limited scripts to 60 seconds

A discussion about running Venus on a host that limits scripts got me thinking about running Venus elsewhere... like in the clouds.  Given that Venus is written in Python, Google App Engine seemed like a reasonable place to start.

The architecture of Venus is split into two halves: Spider and Splice, so for this proof of concept, I’m focusing on Spider.  Spider fetches from the web, normalizes and filters the data, and writes to the file system.  Google App Engine has some limitations.  The ones that affect this demo are the ability to write to files on the disk, lack of threading and inability to access certain external commands within filters, like sed.  The first two can be avoided via monkey-patching, and the last one can be ignored for now.

In order to get started, download and install the Google App Engine SDK for Python, then:

cd google_appengine/
bzr get http://intertwingly.net/code/venus/
cd venus/
wget http://intertwingly.net/stories/2009/04/20/app.yaml
wget http://intertwingly.net/stories/2009/04/20/app.py
cd ..
python dev_appserver.py venus

All that remains is to place a config.ini into the venus directory and then to fetch first spider, and then splice.  Note: if you try this, I’d strongly suggest starting with a smallish config.ini as you will need to wait for a complete fetch and parse of all of your feeds the first go around.

This is just the beginning.  For starters, spider needs to be hardened, locked down, and scheduled as a cron job, possibly incrementally.  Splice needs a prettier URI, some templating engines (like XSLT) won’t be available, and the output should be memcached.  And in place of monkey-patching, the cache support should be refactored out into a separate (and pluggable) module.

But perhaps this is enough to capture the imagination of a potential collaborator?