Persai Feedcorpus Status

Kyle Shank: I present to you the Persai feed corpus: 118,254 feeds of pure greatness.

Let’s check on the status of these URIs.

Results: persai_feedcorpus_status.zip.

Summary:

Count Status Message
45,692 200 OK
1 204 No Content
57 300 Multiple Choices
42,569 301 Moved Permanently
7,589 302 Found
14 303 See Other
7 304 Not Modified
338 307 Temporary Redirect
83 400 Bad Request
95 401 Unauthorized
1 402 Payment Required
702 403 Forbidden
1,437 404 Not Found
9 406 Not Acceptable
16,284 408 Request Timeout
45 410 Gone
7 412 Precondition Failed
3 423 Locked
3,559 500 Internal Server Error
4 502 Bad Gateway
28 503 Service Unavailable


Bah. Over half of them are bogus. Not sure how that qualifies as “pure greatness”. For an AI company, this seems to be strong on one letter, and weak on the other...

Posted by Greg Stein at

301 is not bogus, it just means that the list does not reflect the currently preferred URI.  But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts.

If you exclude those two sites, the percentages are much better.  What’s left is a cross sample of the craziness that you find on the internet.  My favorite is 304.  Note that this is my first request to that particular URI...

Of course, even a 200 OK is no guarantee that the URI returns back a feed.  I’ve seen too many misconfigured sites that return back HTML with an OK...  Heck, I’ve seen a number of sites that return back a status code of 404... along with the data.

Persai has not yet revealed what their backend AI does, other than it uses Apache Hadoop.  Getting a good set of input is a hard, and orthogonal, problem.  Bloglines and GoogleReader certainly have that data.  One solution (if they have access to the necessary servers to do it) is to set up their own Bloglines / GoogleReader system.  It it is attractive enough, the users will come to them, and bring their data with them.

Perhaps Venus could help bootstrap this effort.  It even can take care of a lot of the data cleansing needs.

Posted by Sam Ruby at

IMHO, a corpus that big is going to be only useful for performance testing and figuring out just how liberal your parser is going to need to be.  And in the case of performance testing at least, I don’t see the 301s as likely to be particularly problematic.

The 412 Precondition Failed responses are usually the result of overzealous security code, IIRC.  I wouldn’t be surprised if you were able to retrieve the feed with a browser, but not with the Ruby HTTP client as configured by default.

Posted by Bob Aman at

Who was the one Payment Required?

The 408 is something that your crawler returned for socket timeouts?

If people just want a list of ‘popular’ feeds, we could likely get Bloglines to dump a list sorted by popularity. Lemme know if people are interested in it...

-Paul

Posted by Paul Querna at

Who was the one Payment Required?

rss.mac.com

The 408 is something that your crawler returned for socket timeouts?

Yes.

Posted by Sam Ruby at

Wow.  I’m really not sure how any of you found our posting but this is the internet...  The feed corpus is just 2 days worth of work trying to build a list of unique feeds to seed our crawl with.  It is very immature and I was being sarcastic when I called it a piece of pure greatness ;)

Ideally we want to compile a list of all known RSS feeds.

@PaulQuerna: We don’t care about popularity.  We just want them all. :)

Posted by Kyle Shank at

We don’t care about popularity.  We just want them all.

There are an uncountably infinite number of them.  Where will you put them?

Posted by Mark at

We just want as many feeds as possible that are active and have content to offer.  Its not a matter of storage, but of crawling/parsing/etc.

Posted by Kyle Shank at

Kyle, you might want to watch rpc.weblogs.com/changes.xml then.  The data is very active, but very dirty.  In many cases, all you get is the address of a website.  But if you do fetch that website, you often can find autodiscovery links in the response.

Off-topic: in searching for that link, I came across yet another Feed Validator.  At the present time, Doc Searls’s feed is not well formed XML, but that “validator” declares it to be “a valid syndication Feed”.

Sigh.

Posted by Sam Ruby at

feed list quality

Sam Ruby ran an analysis of the Persai Feed Corpus , showing how many of each HTTP status code he got back when he requested each feed. Even after looking at the site, I don’t really know what Persai is, but I have some experience with long lists...

Excerpt from without an e at


Sam: I’ve looking into the various ping trackers and all of them are overridden with deceptive feeds and spam.  Take a look at the changes.xml for a long list urls and a majority appeared to be spam.

Posted by Kyle Shank at

What a pity.

Here’s a better source then: planet opml files.  If you feel so inclined, here are a few more feeds — I’ve blocked planet intertwingly from being indexed via robots.txt, but you are welcome to include the feeds in your list.

Posted by Sam Ruby at

Thanks for the suggestion Sam!  This list already includes a crawl of the top 1000 opml feeds returned from Google.  The planet query should yield many more.

pardon my grammar in the previous comment, its gotta be the coffee :)

Posted by Kyle Shank at

If people just want a list of ‘popular’ feeds, we could likely get Bloglines to dump a list sorted by popularity. Lemme know if people are interested in it...

I would kill for something like this if it were large enough (say 20,000+ which is the size of my current test list). I wouldn’t say no to a smaller list either. I do a lot of interoperability testing and a good, representative source of feeds is hard to come by. I’m mostly interested in RSS 2.0 feeds (for RSS Advisory Board work), but everything is good.

Posted by James Holderness at


Giant list of feed URLs

[link]...

Excerpt from del.icio.us/gary.bernhardt at


What doesn't clog your algo makes it stronger...

Valleywag outed the startup day job of the guys who collectively edit the the hilarious snark site uncov. The startup, Persai, was “hiding in plain site” since they have a blog and have been pretty open about about the tech......

Excerpt from Skrentablog at

An analysis of the Persai feed corpus.

The comments provide links to other feed links /corpi...

Excerpt from del.icio.us/fitzgeraldsteele at


Persai’s RSS Crawl and Topix

Looks like Rich was playing with the Persai tar.gz web crawl they posted the other day. I got a sinking feeling as I read this. I had curl’d over the corpus already to eyeball it …yeah that’s a list of feeds all right… but...

Excerpt from Kevin Burton's NEW FeedBlog at


First Impressions: Persai

“Blogging Persai” is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title “Flogging Persai” to it....

Excerpt from Blue Screen Of Duds at

Add your comment












Nav Bar