It’s just data

Venus Rising

Today I’m making available Planet Venus, which can only be described as a radical refactoring of the Planet code base.

The reasons for a radical refactoring are several.  My primary reason is that I find that I don’t enjoy working on codebases that don’t have an automated regression test suite.  Furthermore, as codebases with such a test suite tend, in my experience, to be more modular; it generally is difficult to bolt on such a test suite afterwards.  On this point, I would glad to be proven wrong.

A second reason is that a number of people have identified memory consumption as a performance issue with the existing planet.  The current design is to read all the content and meta data associated with every post for every feed that you are subscribed to into memory, update it, write it, and then make multiple passes through this data.  While CPU utilization issues can be mitigated with tools like nice, memory issues are a bit harder to address.

A final reason is that there has been an as-of-yet unmet demand to provide for customization.  Conceptually all of the use cases for GreaseMonkey apply equally to feeds, and in particular, the canonical one of wanting to use the Coral content network selectively applies here too.  This is difficult for feeds, not only because of the various feed formats are out there or due to invalid feeds, but also because some elements may contain plain text, escaped HTML, or embedded XHTML.  Having all markup be pre-sanitized and converted to well formed XHTML will all relative URI references pre-resolved makes the job of producing a plugin script much easier.

This is a work in progress, and not really even ready for experimental use just yet.  I’ve been working on it slowly over a period of time, and this week I happened to have extended periods without network access, and this was something I could play with offline.

If you have an existing planet and want to try this out, take your config.ini, change your cache_directory to point to an empty directory, and run the following commands:

python spider.py config.ini
python splice.py config.ini > examples/index.html

While I don’t yet have template support (patches welcome), I do have a sample xslt file that will produce something recognizable.


[from wearehugh] Sam Ruby: Venus Rising

planet++...

Excerpt from del.icio.us/network/blech at

Cool stuff...

Looks like there might be a glitch in the tarify.cgi config though as the generated tarball and zips are empty.

Posted by Ryan Cox at

Sam Ruby: Venus Rising

wearehugh : Sam Ruby: Venus Rising - planet++ Tags : feedparser feeds planet...

Excerpt from HotLinks - Level 1 at

Looks like there might be a glitch in the tarify.cgi

Oops!  Fixed.  Thanks!

Posted by anonymous at

I noticed your normalization routine is very small.  Is that really sufficient to clean feeds?  If so, that’s surprisingly... um, small.

Posted by Ian Bicking at

I noticed your normalization routine is very small.  Is that really sufficient to clean feeds?  If so, that’s surprisingly... um, small.

UFP and BeautifulSoup do the heavy lifting, I’m just handling what’s left over.

Posted by Sam Ruby at

Hi Sam,

2 bugs...

#1: add this feed in config.ini ([link]) and there is a infinite recursion as follows:

  File “/usr/lib/python2.4/site-packages/xmlplus/dom/ext/__init_.py”, line 231, in GetAllNs
  parent_nss = GetAllNs(node.parentNode)
  File “/usr/lib/python2.4/site-packages/xmlplus/dom/ext/__init_.py”, line 231, in GetAllNs
  parent_nss = GetAllNs(node.parentNode)
  File “/usr/lib/python2.4/site-packages/xmlplus/dom/ext/__init_.py”, line 221, in GetAllNs
  for attr in node.attributes.values():
  File “/usr/lib/python2.4/site-packages/_xmlplus/dom/minidom.py”, line 827, in _get_attributes
  return NamedNodeMap(self._attrs, self._attrsNS, self)
RuntimeError: maximum recursion depth exceeded

#2: On Windows, the files are not getting created because the id(?) has invalid characters for a WinXP platform.

DEBUG:planet.runner:Socket timeout set to 20 seconds
INFO:planet.runner:Updating feed [link]
Traceback (most recent call last):
  File “spider.py”, line 12, in ?
  spider.spiderPlanet(sys.argv[1])
  File “planet\spider.py”, line 86, in spiderPlanet
  spiderFeed(feed)
  File “planet\spider.py”, line 68, in spiderFeed
  file = open(out,'w')
IOError: [Errno 2] No such file or directory: ‘c:\\junk\\planet\\cache\\tag:blogger.com,1999:blog-20638905.post-115500430463204032’

thanks,
dims

Posted by Davanum Srinivas at

I noticed your normalization routine is very small.  Is that really sufficient to clean feeds?

Sam has recently committed several patches to UFP and BeautifulSoup to allow them to take whatever random crap they find in feeds and turn them into well-formed XHTML.  It’s quite impressive, actually, and this project is kind of the culmination of that effort.

Posted by Mark at

add this feed in config.ini ([link]) and there is a infinite recursion

My guess is that the Python runtime library is jumping to a conclusion.  Follow that link and scroll down.  Look closely at the bottom of the entry entitled A Trip to 19th Century - America.  I can only imagine what that looks like after first the UFP and then BeautifulSoup got processed it.

On Windows, the files are not getting created because the id(?) has invalid characters for a WinXP platform.

OK, I’ve verified that colon characters are a problem on win32.  Now the question is whether the mapping should be platform specific, or whether the ability to migrate caches to another architecture is an important feature.

Posted by Sam Ruby at

Venus? Nay surely, you mean Mars.

Posted by Bill de hOra at

Nay surely, you mean Mars.

Mars will follow Earth, and will be in Ruby.

Posted by Sam Ruby at

“Mars will follow Earth, and will be in Ruby.”

Oh man! There is no planet sun or star could hold you, if you but knew what you are.

Posted by Bill de hOra at

Furthermore, as codebases with such a test suite tend, in my experience, to be more modular; it generally is difficult to bolt on such a test suite afterwards.  On this point, I would glad to be proven wrong.

I agree completely. There is a great book that makes the work a little easier though: Michael C. Feathers' Working With Legacy Code. He even defines legacy code as “code without tests”. :)

Posted by Michal Wallace at

Sam,

For the “colon characters are a problem on win32”, may i suggest a simple urlencode()/urldecode() which i believe will work across architectures?

thanks,
dims

Posted by Davanum Srinivas at

Taking a look at what the original purpose of the filename function was, and at my existing cache, I decided to go for shorter, more cruft free names; in the process I made the names Win32 file system compatible.

Those that ignored my advice and deployed this code and chose to update would be well advised to flush their cache.

I also made a change to treat pretty-printer errors (including the near-infinite recursion) as non-fatal.

Posted by Sam Ruby at

Sam,

the recursion is gone now. will try win32 later and let u know if there’s a problem.

thanks,
dims

Posted by Davanum Srinivas at

[from ade] Sam Ruby: Venus Rising

“Planet Venus, which can only be described as a radical refactoring of the Planet code base” Now all we need are regular releases as opposed to the snapshot-of-the-month club and I’ll be a happy bunny...

Excerpt from del.icio.us/network/nephariuz at

Btw, anyone interested in this sort of thing might also want to check out Plagger.

Posted by Aristotle Pagaltzis at

links for 2006-08-17

Bare Naked App » Blog Archive » Displaying percentages (tags: ajax css webdesign ui) autotut: Using GNU auto{conf,make,header} (tags: development autoconf automake howto) PycURL Home Page (tags: python curl http lib) Brad Choate: ack (tags: cli...

Excerpt from Breyten's Dev Blog at

links for 2006-08-17

From the blogroll… Why use anything else? Venus Rising Google Talk Adds Voice Mail, File Sharing...

Excerpt from The Robinson House at

For the ghosts of the Lazy Web

I was thinking of an idea to leverage FOAF, but it’s probably nothing new. I want to leverage foaf:OnlineAccount more and have a service that using FOAF generates an OPML file to all of the user’s content distributed on the Web for...

Excerpt from Elias Torres at

I’m the lazy web

My last post was on making use of foaf:OnlineAccount information found in FOAF files to create a complete OPML or better yet feed (or personal planet) of all the information your friends are dumping all over the web. As already stated, I...

Excerpt from Elias Torres at

links for 2006-08-17

the dreaming tree » Blog Archive » bodies (tags: gfmorris_comment) Through a Glass, Darkly » Back to school, back to school, to prove to dad that I’m not a fool. The sound I just heard was Kari vowing to never speak to me again. Or something....

Excerpt from Geof F. Morris's Indiana Jones School of Management at

Bloglines working to make things better

Bloglines is still having difficulties with subscriptions and posts: yesterday evening it happened again, and I lost unread entries from random subscribed weblogs. At least, this time their problem report seems to be a little bit more concerned than...

Excerpt from The Long Dark Tea-time of the Blog at

Mars will follow Earth, and will be in Ruby.

Do you mean this Mars?

Posted by Giulio Piancastelli at

Do you mean this Mars?

Looks like a name-squatter.

Posted by Sam Ruby at

I get the following :

$ sudo python /var/www/venus/splice.py /var/www/planet/planets/quebecois.eu/config2.ini > /var/www/unix.tv/index.html

ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/index.html.tmpl
ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/atom.xml.tmpl
ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/rss20.xml.tmpl
ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/rss10.xml.tmpl
ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/opml.xml.tmpl
ERROR:planet.runner:Unable to locate template /var/www/planet/planets/quebecois.eu/foafroll.xml.tmpl

user@server:/var/www/planet/planets/quebecois.eu$ ll
total 72K
4,0K -rw-r--r-- 1 user user 2,2K 2006-07-27 01:53 atom.xml.tmpl
4,0K -rw-r--r-- 1 user user 2,9K 2006-09-10 13:59 atom.xml.tmplc
8,0K -rw-r--r-- 1 user user 5,7K 2006-10-01 17:23 config2.ini
8,0K -rw-r--r-- 1 user user 5,7K 2006-09-30 01:51 config.ini
4,0K -rw-r--r-- 1 user user  921 2006-07-27 01:53 foafroll.xml.tmpl
4,0K -rw-r--r-- 1 user user 1,2K 2006-09-10 13:59 foafroll.xml.tmplc
8,0K -rw-r--r-- 1 user user 5,0K 2006-09-29 03:05 index.html.tmpl
8,0K -rw-r--r-- 1 user user 5,4K 2006-09-29 03:05 index.html.tmplc
4,0K -rw-r--r-- 1 user user  626 2006-07-27 01:53 opml.xml.tmpl
4,0K -rw-r--r-- 1 user user  971 2006-09-10 13:59 opml.xml.tmplc
4,0K -rw-r--r-- 1 user user 1,2K 2006-07-27 01:53 rss10.xml.tmpl
4,0K -rw-r--r-- 1 user user 1,5K 2006-09-10 13:59 rss10.xml.tmplc
4,0K -rw-r--r-- 1 user user  838 2006-07-27 01:53 rss20.xml.tmpl
4,0K -rw-r--r-- 1 user user 1,3K 2006-09-10 13:59 rss20.xml.tmplc

Posted by Gabriel Labelle at

Gabriel: two things.

First, can you try running “python runtests.py” to verify that all is well?

Then can you set the following in your config.ini and try again?

log_level = INFO

You might need to set template_directories in your config.ini file.

P.S.  There now is a planet.py main program.

Posted by Sam Ruby at

Thanks Sam!

Another thing ... I would like to activate the coral_cdn_filter.py filter but I simply don’t know what to do?

I simply want to be able to view the blogger images on my planet.

Posted by Gabriel at

What you need to do is to add:

filters = coral_cdn_filter.py

to either the [planet] section, or to each of the feeds on which you want this filter to be run.

Note: filters are run before the data is written to the cache, so you will either need to delete the cache or wait until new entries appear.

Posted by Sam Ruby at

OK thanks I now see the changes to the index.html, but the image still don’t show up for the blogger feeds ... Maybe I’m missing something?

When I test an URL, for example [link] and still get the forbidden error 403.

My test planet running venus is at [link]

Posted by Gabriel at

Планета Венера

Днес смених софтуера, който задвижва “Българска свободна планета” и “Планета GNOME”. Всъщност “смяна” е силно казано, защото проектът Venus e напра... [more]

Trackback from Arcane Lore

at

Ясен Праматаров: Планета Венера

Днес смених софтуера, който задвижва “Българска свободна планета” и “Планета GNOME”. Всъщност “смяна” е силно казано, защото проектът Venus e направен на базата на добре познатия Planet. Sam Ruby е много енергичен ентусиаст явно - след като дълго...

Excerpt from Българска свободна планета at

Bloglines vs. Google Reader

Estou ficando meio decepcionado com o Bloglines pela confusão que ele anda fazendo atualmente em identificar entradas novas em um blog, principalmente quando eu peço para o programa manter entradas como novas. Eu andei experimentando com o Google...

Excerpt from Superfície Reflexiva at

Things that inspire envy of computer languages I don’t use

Ruby, Python, Perl, PHP, JSP, etc. For web development, the languages themselves have certain strengths, but they also, eventually, acquire software projects that make them famous. Ruby, of course, has Rails, and through that, all the software of 37...

Excerpt from Closer To The Ideal at

Why I keep using my own pulse

I’ve been a fan of personal feed aggregation services for a long time. I’ve been trying: Suprglu iStalkr Feedfriend and Plaxo Pulse I’ve even built my own pulse once and twice . Now Plaxo announces something new: The Plaxo Pulse Widget allows you to...

Excerpt from Lars Trieloff's Collaboration Weblog at

Why I keep using my own pulse

I’ve been a fan of personal feed aggregation services for a long time. I’ve been trying: Suprglu iStalkr Feedfriend and Plaxo Pulse I’ve even built my own pulse once and twice . Now Plaxo announces something new: The Plaxo Pulse Widget allows you to...

Excerpt from Lars Trieloff: Recent Changes on SuprGlu at

RSS Feed Aggregators

For the Atlanta Java User Group I have been looking into providing an RSS feed aggregation service for blogs of AJUG members. Since AJUG is running its own server, I wanted to provide a server-based solution. Interestingly, it seems the choices are...

Excerpt from Gunnar Hillert's Blog at

Add your comment