intertwingly

It’s just data

Spider Threads


Joe Gregorio: This branch includes httplib2 to handle the fetching. I have added a new config option ‘spider_threads’ that you can set to the number of threads you want to use when spidering. The default is 0. When spider_threads is set to zero httplib2 is not used and feedparser is used to fetch the feeds. Note that the threading only applies to HTTP(S) URIs, all other URI types are done in the main thread and handled by feedparser. All parsing is also handled only in the main thread.

I’ve merged this work into my branch.  While there is more work to be done (e.g., better reporting of status codes, IRI support) a rather dramatic speedup is possible with this option, even with a relatively low setting, like 5.

You can see this in action by viewing my log file.