LRU caching weblogs
My expectations were like Phil's, but the reality is subtly different. Upon generating a new post, index.rss often regenerated faster than I can ping weblogs.com. Then comes index.html, comments.rss, index.rdf, and index.rss2, generally in roughly that order. What's more interesting to me is what happens to the other pages. By running the following script every night, I can keep my cache small:
find . -type f -atime 1 | xargs rm -f find . -type f -ctime 7 | xargs rm -f find . -type d -empty | xargs rmdir
This deletes all files that haven't been accessed in over a day, files that were created over a week ago (ensuring that even minor template changes get propagated), and deletes all directories which have been made empty as a result.
Not surprisingly, most of the files that remain are html files, despite all the various alternative formats I support. Even so, the total cumulative effect of all the various bots running throughout the day is to only touch 20 to 50% of my blog entries. This works out to be approximately 1 to 2 a minute, though the reality is considerably more bursty than that. Only 2 to 3% of all my entries (currently this number is 28 out of 1207) are touched more than once in a day, many of them by actual humans, typically be following inbound links or google queries.
All of this data could have been obtained by analyzing my Apache logs, but it is much more readily apparent by looking at my cache.