It’s just data

Are My Ears Ringing?

TrackBack and Pingback are push technologies.  Mark's Automatic linkbacks and my (unnamed) excerpting functions are pull technologies triggered by referers.  The purpose of each is roughly the same: to bring my news to me instead of having me have to go foraging for it.

I actually experimented with mark's code for a bit, but the biggest problem I had was that it looked like it would require continual investment to weed out the ever growing number of portals and personal aggregators.  I was also concerned about the feedback loop that could occur given the amount of back traffic I get whenever I mention anything on Mark's page.

By relying on links, I'm also weeding out people who don't have RSS feeds or can't follow instructions.  Unfortunately, I'm also weeding out those who chose to remove the juicy bits (from a data mining perspective) from their weblogs.  For my purposes, feeds like Mark's are best: they contain all the rich content yet provide a prepackaged, simple and clean excerpt for me to deal with.  Or feeds like Burningbird's or Ben Hammersley's who chose to take the initiative and send the excerpts to me.

One thing that has pleased me is that I have noted that several people have added link information that wasn't previously there.


Portals and such are not as large a problem as I imagined, although I do manually maintain a list. Some code could probably be added to check for multiple inbound links on the page, since most portals will list the most recent 5 items (or whatever) from your feed. If a page has all of the most recent 5 links, chances are it's a portal.

The feedback loop from other people doing the same thing is a problem, though. I noticed this when I linked to something Dave Johnson wrote. He now implements a similar system of "further reading" linkbacks, and my linkback script picked up his linkback to my linkback. Or something. Anyway, it was just a bunch of machines talking to each other, which is fine, but I generally don't want to expose that in my UI.

I always said that my "further reading" was like a prisoner's dilemma: great for me as long as nobody else did it. Once everybody's doing it, the feedback loops start and the signal-to-noise ratio skyrockets and it becomes worthless.

Posted by Mark at

Pingback from Simon Willison: Archive for 15th January 2003

at

oh, bah. the juicy bits were in the <title> and précis line in my feed. however, see http://ken.coar.org/blog/index?entry=71 -- if you want the entire content, add 'words=0' to the GET arguments. (use some other number to get an appropriately-sized excerpt.) this allows you to select just how much content you want in my RSS response to your query.

Posted by Rodent of Unusual Size at

Ken, just so it is clear, I have an automated process which follows this:

<link rel="Alternate" type="application/rss+xml"
title="RSS" href="http://ken.coar.org/blog/index.rss"/>

looking for

<a href="http://www.intertwingly.net/blog/1117.html">

Something that is on your web page, but not in your rss, even if ?words=0 is added.

It's certainly your right not to provide this information.

Posted by Sam Ruby at

Does the feedback loop make linkback totally impractical? I mean... even if you implement linkback using RSS feeds gleaned from link tags, somebody could start refererencing linkbacks inside RSS feeds (just as some bloggers put reader comments inside RSS feeds) and you'd have another feedback loop.

Posted by Dave Johnson at

What about if you only scan their HTML file one initial time for the linkback at a certain url, and from then on simply incremeent the visitor count in your database or file or whatever, when you get linkbacks from there? Surely then the flow would go:

They link to you.
People follow link to your site.
Your system scans their site and finds their link, produces an excerpt on your page.
They detect visitors going to their site from yours, scan yours, add a linkback.
People can follow either their linkback or their original link to your site, and it doesn't matter which, all your script does is increment the visitor count by one and use the same excerpt.

Am I missing something here?

Posted by Lach at

Dave - I could put comments in my "regular" feed, but I don't. Furthermore, I don't know anyone who does. This makes it easy to break the loop.

To show the extent of the problem, here's a true story - a fair number of my referers are from 0xdecafbad. It seems that people frequently use my blogroll, and he is at the top. He has a "recent referrers" list, and since I'm often on it, I get hits. If I scan his html, these are valid links to specific blog entries...

Posted by Sam Ruby at

Lach, you are correct, I can optimize this a bit further. By the way, I am not tracking hits by post by referrer.

For what it is worth, when not debugging my script, I validate links only once an hour. If there are multiple hits within the hour, I still only check once. If multiple distinct pages reference the same rss feed, I again will only check that once per hour.

Posted by Sam Ruby at

ah. good point. right-ho, sam, fixed; just add 'sanitise=false' to the arguments (i.e., 'count=0&sanitise=false') and the markup will not be stripped out -- and you can do your mining.

i need to document these controls at some point. there's no <link> syntax for indicating alternate feeds, is there?

btw, i notice that your comment processor leaves < i > -- but encodes < em >. oversight? or is semantic markup considered ungood in comments?

Posted by Rodent of Unusual Size at

feh. type too fast, see what you get. changes, sam: see http://ken.coar.org/blog/index?entry=72 . you want the entire content for mining? use
"?words=all&sanitise=false" on the RSS URL. (which can be for the current selection of entries [10 by default, but customisable with "count=n"], a specific entry, all entries for a particular month or day, or all entries within a particular timeframe).

hope this helps..

heh.. one thing i notice: people who rely on automatic interblog communication are sometimes treating those who haven't come up to that level as second-class citizens, or at least less important than those on the bleeding edge. for instance, despite the glory of mark pilgrim's who-has-linked-to-me referral scanback concept, somehow it hasn't managed to locate any of my referrals to him. i wonder why? somehow i doubt that any trackback references to his articles get ignored. or maybe i'm just paranoid. yeah, that's it.

Posted by Rodent of Unusual Size at

p.s.: consider adding the "Comments [n]" info to your RSS feed, so i (et alia) can tell in my aggregator whether there are new bits to peruse.

Posted by Rodent of Unusual Size at

Go ahead. Adjust your set. Mwahahaha! Ahem.

RSS controls and the [in]glory of browser tailoring Updated: Thursday, 16 January 2003 07:01 EST Mark Pilgrim has decided to deal with client differences in CSS handling by having browser-specific stylesheets. That was one of the things I...

Excerpt from Ken's Blog from the Burrow at

Various updates/answers. My script now has in it:

if rss.find('ken.coar.org/blog/index.rss')>0: rss+='?words=all&sanitise=false'

I've added <em> to the list of tags I support. Originally, this comment field was intended to be text only, but those dang users...

My rss2 feed has an indication of the number of comments. There also is a separate feed for comments in various flavors of rss. In fact, you can get an rss feed for any single blog entry by simply replacing ".html" with ".rss" or ".rss2" or ".txt" or ".esf" or...

Finally, there seems to be something I am still debugging in my script... it complains about a unicode error, but unfortunately (as near as I can tell) Python is reporting it on the wrong line. Expect continued bursts of activity on your rss feed as I attempt to isolate and squash...

Posted by Sam Ruby at

h'm. then i guess it's my aggregator (amphetadesk) not bothering to inform me about the comments.

btw, if you want to scan all of my blog entries at once, to pick up any old references you might have missed, add 'count=all'.

Posted by RoUS at

Linkback feedback loopback.

Sam Ruby and Mark Pilgrim, who both have weblogs with automatic linkback implementations, both linked to my Introducing Automatic Linkbacks in Roller post the other day. This created a linkback feedback loop. Luckily, I anticipated that some...

Excerpt from Blogging Roller at

Roller TODOs.

Matt has put together a nice Roller 0.9.7 TODO list for himself. Cool stuff. The "remember me" feature sounds especially useful. Apart from finishing-up the linkback feature, the main thing I would like to do is to fix comments. I would like to...

Excerpt from Blogging Roller at

Separating the linkbacks from the feedbacks.

I still owe you a write-up of the Roller linkback implementation. I'll get around to that when I get around to finishing it. I'm way too busy with other things to work on linkbacks right now, but I have been giving it some thought. The linkback...

Excerpt from Blogging Roller at

Pingback from Stephen S Kelley's Web Surfing : Saturday, January 18, 2003

at

TechnoBot

Cool.  I already use Technorati to help me find links to excerpt. What interests me is Jabber alerts on comments of any kind.  First to me (of course!), and then perhaps to those who register interest in such thing.  And/or IRC, which... [more]

Trackback from Sam Ruby

at

Thoughts on what has worked, and what has not worked so well, with my automated linkback excerpter function thingamabob TrackBack and Pingback are push technologies. Mark's Automatic linkbacks and my (unnamed) excerpting functions are pull technol...

Excerpt from BHDP: Trackback Threading at

Wordpress Trackback Validator Plugin

I just saw the WordPress Trackback Validator plugin fly by my aggregator and immediately installed it. I knew Dan online back in middle school, so with this endorsement, I installed it instantly: The Computer Security Lab at Rice just released the...

Excerpt from Matt Croydon::Postneo 2.0 at

Add your comment