It’s just data

Planet Pruning

Phil Wilson: Since I am too lazy to manage my own subscriptions, I was subscribed to Planet Intertwingly. At 269 feeds though, the signal/noise ratio has taken a bad hit (what do you mean Sam doesn’t tailor his blogroll for me personally?) and I’m going to have to actually import the OPML and weed out stuff I’m not interested in. How annoying.

The issue isn’t the number of feeds, but the number of entries.  And some of the people I subscribe to talk a lot, so it is time to prune.  To help with this task, I wrote a little script.

So, fair warning.  If you are subscribed to Planet Intertwingly and would like to keep seeing entries from some of the big names you see on this page, you might want to subscribe to those feeds separately.  Unless, of course, you are using Google Reader, in which case perhaps you shouldn’t be.


i don’t find the volume of planet intertwingly bad at all, but it is annoying when entries from chris anderson and guy kawasaki keep jumping to the top. especially when it is a tremendously huge post like guy’s post about techshop.

Posted by jim at

That’s easily fixable.

ignore_in_feed = updated

Done on both feeds.

Posted by Sam Ruby at

/me glances at list of “big names”

Yeah, that Sam Ruby fellow posts way too much.

Posted by Rod Begbie at

Unless, of course, you are using Google Reader, in which case perhaps you shouldn’t be.

Are you suggesting that Google Reader should be ignoring posts with duplicate ids even when those posts are from different feeds? If they were to do that, what’s to stop me posting a bunch of messages with ids of the form tag:intertwingly.net,2004:xxxx thereby blocking any future messages from you?

Posted by James Holderness at

[from sogrady] Sam Ruby: Planet Pruning

wherein i get pruned from Planet Intertwingly - and i thought i wasn’t posting enough ;)...

Excerpt from del.icio.us/network/jasonscheirer at

ignoring

Ignoring?  No.  But provide at least an option to recognize entries that have already been read?  Yes.  Like some other feed readers do.

Posted by Sam Ruby at

Sam, thanks for highlighting that issue here. Maybe it might receive some attention now!

James, nice comment. RFC4287 does mention this possibility. XML digital signatures could also be used, although off-hand I can’t think of any readers that currently use this. My current thinking is that tool support should be offering us some more options for handling this - like Sam mentions, some already do. I personally need Google Reader to do the same, since I don’t relish the prospect of migrating given the meta-data that I’ve currently associated with my feeds.

Posted by James Abley at

We could always all jump on the google reader forum and vote for James' bug report. Although in the past google engineers have left comments here, so I expect they’ll see Sam’s criticism anyway.

Posted by Brian Ewins at

But provide at least an option to recognize entries that have already been read?  Yes.  Like some other feed readers do.

If a feed reader can be tricked into marking something as read when you haven’t read it, that can be just as bad as having the item ignored altogether. Some people configure their feed readers to only show unread items (I believe this is the default interface in Bloglines for example - or at least used to be). Under those conditions, a message that is marked as read might as well not exist.

Also, I don’t know (maybe you do) whether NNW is storing one copy of the message visible from both feeds, or two completely separate copies that are merely linked for the purpose of checking their read status (a more expensive operation IMO). If it’s the former, I could use my bogus feed to control the contents of all your posts - that’s arguably worse than not seeing your posts at all.

Maybe I’m missing something, but I just don’t see atom:id as the magical silver bullet that Tim was making it out to be.

XML digital signatures could also be used

I don’t see how. What did you have in mind?

Posted by James Holderness at

This gets interesting. Rather than pruning, you can start to sample from entries, maybe based on whatever cluster metric goes into the memes. I guess if you take that to the extreme, you get Google News.

Posted by Hugh Winkler at

XML digital signatures could also be used

They could help but it could get rather complicated rather quickly.  DSigs in feeds have their merits and will be essential in some cases, but for this use case, it’s likely overkill.  There may be room for a more lightweight solution that is a bit more reliable than atom:id’s

Posted by James Snell at

links for 2007-09-14

Erwann Chénedé’s Weblog interesting. compiz is already up and running on Solaris. (tags: compiz solaris 3d desktop gnome graphics via:glynn) More translating corporate speak | LinuxWorld Community “English: ‘Remember Unix source licenses? We’re...

Excerpt from tecosystems at

The script would be welcome in venus examples, or even in a new contrib directory. I couldn’t see the script in Venus' metaplanet, which, BTW, could enjoy a new subscriber, as I managed to get my repo exposed at the server and stable enough, I hope. I have it using a Trac instance, which gives neat diffs, and a couple of minor changes, too.

Next step (in my master plan) is exposing my mombo’s changes and start it going.

Posted by Santiago Gala at

I agree with James Snell that XML Digital Signatures are likely to be too heavyweight for this use case. For situations where you absolutely can’t have this happen, they have a place, but I would hope that social patterns would govern this situation. I would like the spec to have explored this situation in more detail, but it doesn’t. Oh well, shit happens. So IMHO, what is likely to happen is that sites which produced such feeds would either be labelled as malicious, bozos or responsive to bug fixes, depending on how they respond to the community reaction. Consistent offenders would find that their content wouldn’t get re-syndicated or aggregated in other places. If you don’t play well with the other kids, then they won’t want to play with you.

James (Holderness), care to flesh out any other scenarios around this so we can explore how well the spec caters for them and how best to defend against it?

Posted by James Abley at

I would hope that social patterns would govern this situation.

Exactly.  I think you have done a good job outlining the scope of the problem, now let me add some real life details.  I publish Planet Intertwingly.  The subscription list and software that produces that page are also published.  If you work out the details of what James Holderness can do TODAY are fairly minimal and whatever he does would leave an audit trail, one that would lead back to his site.

All this reminds me of the days when people said that wiki’s would never work as you needed strong authentication and access control.  In their place, wikis put in versioning and auditability.  In this case, what we have is actually stronger authentication than wikis tend to have the host from which the entry was fetched is recorded.

Think gloves, people, gloves.  Venturing outside on a cold winter’s day, or addressing James Abley’s very real need, doesn’t have to wait until either an effective solution to recirculating waste heat relying only bicycle power or a strong identity system is in place.  It can be done today, and in much the same manner that Google deals with the rest of the web.

Posted by Sam Ruby at

Ok, the gloves story was very funny, and maybe I’m a “Complicator”, but I think you’re missing a number of attack vectors (that won’t necessarily leave an audit trail), possibly because they wouldn’t apply to you. But I’ll just shut up for now. If/when somebody other than NNW gets around to implementing this (someone whose code I can actually evaluate), I’ll see how many holes I can poke in their solution.

Posted by James Holderness at

If/when somebody other than NNW gets around to implementing this (someone whose code I can actually evaluate)

A number of sites are based on Planet 2.0 or Venus.  You can evaluate that code.

Posted by Sam Ruby at

A number of sites are based on Planet 2.0 or Venus.  You can evaluate that code.

None of my devious schemes would likely pose any threat to a planet site. I’m thinking more in terms of personal feed readers like NNW and Bloglines. The difference is in the types of feeds that are likely to be subscribed, and how that feed content is likely to be stored and viewed.

Posted by James Holderness at

Re measuring the authenticity of a feed; what about a simple algorithm that checks the URI of atom:source against the base URI of the feed itself? If those match, the entry is most probably downloaded from its original source, and duplicate atom:ids from a different atom:source (or a spoofed atom:source) will be overwritten. Easy enough, isn’t it?

Posted by Asbjørn Ulsberg at

there are a couple of more simple options..

1) an aggregator could just compare contents and/or timestamps of the two items with the same atom:id, and just mark it as read if they are the same

2) first time your aggregator notices two different items with the same atom:id (different contents, from different feeds), it can ask you what to do with them (this time, and in the future):
  a) leave them both, or
  b) let you select the feed you trust more..

i feel that (1) would cover all planet-like use-cases, (2a) could be used if two different legitimate feeds produce the same IDs by mistake, and (2b) would provide ample protection from the bad guys.. ;)

Posted by Tomislav Jovanović at

A web without URIs is like a language without functions

Stefan Tikov quoting Jonathan Allen : With all the buzz about Halo 3, Microsoft couldn’t help but to use it as an excuse to make users download SilverLight. It is not really a showcase though, as it does not do anything an experienced designer...

Excerpt from Boxes and Glue at

Add your comment