It’s just data

Bloglines Rocks!

B

I’ve given Bloglines a fair amount of grief over the past few months over their pathetic-at-the-time handling of Atom feeds.  I’m not ego-centric enough to believe that I got them to change – at most, I may have increased awareness of the issue to the point that the Bloglines team got around to addressing the issue slightly earlier than they would have otherwise.

But address the issue, they did.  I’m pleased to say, they flattened it.  They nailed it.

I posed four issues, and one extra credit problem.  They were posted on a Saturday.  By Monday, three were fixed.  By week’s end, All four were fixed.  Since that time, perhaps another half a dozen issues were noticed.  Each were resolved, with the fix not only implemented but deployed, each within one business day of when it was reported.  So while the Bloglines team didn’t get the extra credit problem I posed (which given the current implementations of browsers these days, would have required that they convert all content into consistently well-formed XHTML, and even then wouldn’t have worked on the dominant browser in use today), they clearly aced the test and found other ways to excel even more.

So from me to the Bloglines development team: YOU GUYS ROCK!

Speculation

Apparently, Bloglines' parent company has a policy which severely curtails the amount of public discussion that Bloglines developers may engage in.  From my perspective, such polices are unfortunate, but at the end of the day, that decision is theirs to make.  But based on the available facts, the following is my interpretation of the public information available.  I may have some of the details wrong, but I’m confident that I have the broad brush strokes essentially correct.

The only charitable way to put it is that the current parser inside of Bloglines evolved over time.  It had (and has) to deal not only with multiple, incompatible, and underspecified specifications, but also with multiple, incompatible, and often non-compliant implementations of these specifications.  Previously, I’ve cataloged a few of the most common errors.

Along the way, the Bloglines aggregator has become a part of the feed eco-system.  Simply put, many people design their feeds not according to any specification, but rather to make sure it works with Bloglines.  To be clear, Bloglines is not unique in this regard, others do the same with NetNewsWire, and I expect many to do the same with IE7.

The inevitable result is calcification.  Software that was once, well, soft and pliable, has since become something you daren’t touch as you risk affecting the way that untold millions of feeds are interpreted.  You don’t touch it even if it means that Bloglines differs from either the specifications themselves or the way tools like NNW or IE7 handle these same edge cases.

With Atom, Bloglines has decided to pursue a fresh beginning.  A beginning free from the tyranny of the past.  The new status quo is that if you have a test case which is based on real world usage, and can point to the section of the spec which indicates how this test is to be interpreted, then the Bloglines development team will not only address the issue forthwith, but they will also add a test case to their regression test suite so that the same issue will never reoccur.

A clear spec + a regression test suite + Red/Green/Refactor =>

a parser which is not only maintainable,

but also one that can remain so indefinitely.

One Request

This represents a important step forward.  Given that I can pass around URIs of how Bloglines interprets any given public feed that anybody can view registration free and from any browser by any manufacturer [example], Bloglines – perhaps unwittingly – has found itself in the position of being a de facto reference implementation for feed specifications. 

This is exiting, and frankly, a little scary.  But as long as Bloglines keeps their commitment to conformance, I’m confident that it will all work out.

But I do have one request.  Clearly, everyone has equal access to the specification.  And they also have equal access to the Feed Parser Tests.  And a member of the Bloglines development team has already publicly stated that they based many of their tests on the FeedParser tests, which is very, very cool.

My one request — and this is not just for Bloglines, but rather for everyone who may have benefited, either directly or indirectly, from the Feed Parser test suite:  if in the process of your development you have identified some additional tests, please consider donating as many as you can back so that everybody can benefit.


now all they need to do is nail performance.

Posted by James Governor at

While I was attending Gnomedex, I spent an hour during one of the parties discussing Bloglines with a couple of people from the company. They were genuinely interested in what they could do to make the product better. I was a dedicated user before, but now feel like a genuine supporter.

Posted by Michael Pate at

I’ve tried Google Reader, Netvibes, Sage (for Firefox) and many other feed readers and finally settled down with Bloglines, ajax is not too much intrusive, is simply done right.

I’m fairly happy with Bloglines, once in while some already readed feeds are still being marked like unreaded, dunno why but not a big problem anyway.

Keep up the great work Bloglines team.

Posted by michele at

I dunno, James. It’s always been snappy for me.

Now that they’ve got Atom down, it’d be good to see them work on some of the remaining flaws in their RSS parser. I think they still have some problems with the way it normalises HTML.

Aside: why’s the spellchecker complaining about my Rightpondian spelling of the word 'normalise'?

Posted by Keith Gaughan at

Sam Ruby: Bloglines Rocks!

[link]...

Excerpt from del.icio.us/miyagawa at

I like the way Google Reader also interprets any given public feed that anybody can view registration free as well.  They give a preview of how it looks in Reader and then a view of the feed they produce which can be used anywhere, I think.  Their URLs seem to outline their API, which I’m finding interesting enough to use to build yet another feedreader.  (Common pet project? Yes.)  However their titles in Atom still don’t entirely match your conformance tests, though, I think.

Posted by Robin Thomas at

Nice. Perhaps I can stop calling them buglines. It seems I can stick with them now.

Posted by Darryl at

I like the way Google Reader also interprets any given public feed that anybody can view registration free as well.  They give a preview of how it looks in Reader

Robin: THANKS!  I missed that before.

and then a view of the feed they produce which can be used anywhere, I think

¡Ay, Dios Mio!  What happened to all my lovely well formed XHTML!  ;-)

I’ve got a half written blog entry that details some of the eye opening experiences I had trying to pummel the UFP, BeautifulSoup, and Planet into supporting MathML and SVG.  One of these days, I’ll finish it.

Otherwise, they don’t do a half bad job, though it is unfortunate that they displace the real ids with their own (stashing the original into an attribute).  They also source the entries to the feed that they got it from, and not to the original source.

However their titles in Atom still don’t entirely match your conformance tests, though, I think.

Google’s Reader is known to not do a very good job on handling Atom Titles.  In fact, they fail all of the title tests.

(But I have noticed that things that are mentioned in this weblog do tend to get fixed ;-))

Posted by Sam Ruby at

There’s a couple extra test cases among FeedTools' set that aren’t in the Feed Parser list.  They’re pretty easy to spot since they tend to be coded as heredocs.  Feel free to take and use whatever you want.  Not sure what’s actually useful to you guys though.

Posted by Bob Aman at

the eye opening experiences I had trying to pummel the UFP, BeautifulSoup, and Planet into supporting MathML and SVG. 

It turns out that, within the Mozilla Toolkit parser, I seem to be able to append tagsoup HTML DOM fragments to XHTML nodes. This makes it possible to support MathML and SVG, since Atom XHTML content stays in the XML content flow... got a whitelist for MathML and SVG?

Posted by Robert Sayre at

got a whitelist for MathML and SVG?

As I mentioned, I’ve got a bunch of experiences that I should take the time to write up, but yes, I do have a whitelist for each.  Check out the latest feedparser from CVS, and look for mathml_elements, mathml_attributes, svg_elements, svg_attributes, and svg_properties (the last one being CSS properties).

Things to look out for: xlink address space, case sensitivity in svg element and attribute names (I only handle the latter so far), in some SVG documents, style attributes are fairly essential, so you want to sanitize these too (see my heuristics there too, feedback welcome).

I also had problems that you won’t likely run into: for example a SGML parser has no concept of empty-elements, and tools like Inkscape often produce empty <svg:defs>, and various filters within feedparser (like the one that resolves relative references) make use of a sgmlib.  Since <svg:g> elements can exist both within and outside of an <svg:defs> element, this is pretty much fatal.

Also to my surprise, a number of patches I attempted to submit to the SGML parser (like basic Unicode support) were rejected as they weren’t valid SGML. 

Look in the tests/(well|ill)formed/namespace directory for *svg* and *mathml* tests.  They are rather minimal at the moment, but cover the above cases.  At the moment, the feedparser doesn’t convert valid html into xhtml which is invalid as html, so there are a few cases where xhtml:body in RSS 2.0 and type="xhtml" in Atom will produce better results than svg included in RSS 2.0 description or Atom content of type="html".

So my ¡Ay, Dios Mio! isn’t totally facetious.

Posted by Sam Ruby at

Sam: “Look in the tests/(well|ill)formed/namespace directory for svg and mathml tests.  They are rather minimal at the moment, but cover the above cases.

Unfortunately, they’re also not atom :-(  (e.g. <atom xmlns="http://www.w3.org/2005/Atom">...<content><div ...)

Posted by James Snell at

Unfortunately, they’re also not atom

Oops!  Fixed.

Posted by Sam Ruby at

in some SVG documents, style attributes are fairly essential, so you want to sanitize these too

Ugh, the fallout of invention by specification. That pushes feed SVG support into 2007.

a number of patches I attempted to submit to the SGML parser ... were rejected as they weren’t valid SGML.

HTML5: While the HTML form of HTML5 bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules... few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators...

Posted by Robert Sayre at

Ironically, the “clear spec” referenced above “cannot be found.” The actual document is at [link]

Cheers :)

Posted by Jeremy Voorhis at

Ugh, the fallout of invention by specification.

I disagree.  The image at the top of this entry’s page has a single “float:right”.  SVG also participates in the DOM, so it can be scriptable by applications like GreaseMonkey.

That pushes feed SVG support into 2007.

To be clear, the elements and attributes I selected were the ones defined by the Tiny SVG profile which does not include style.  I then added elements and attributes that I saw that were heavily used by SVG files I could find on places like WikiPedia.

the only user agents to strictly handle HTML as an SGML application have historically been validators

A true statement, but a very misleading one.  Take a look at the value of charref and the implementation of handle_charref in Python’s sgmllib.py.

The UFP overrides both.  The latter is per the design.  The former affects all users of sgmllib.  BeautifulSoup does the same thing, and actually “borrowed” the code from some version of the UFP.  Of course, that code changed, so what the UFP actually supports at runtime depends on the order in which modules were imported.

Submitting patches to fix this will get rejected.

Oh, and the workarounds that the UFP currently apply won’t all work with Python 2.5.  I submitted a number of patches which did get accepted which should make it possible for the UFP to introduce additional workarounds for Python 2.5’s implementation.  But somehow I suspect that the long term solution is for the UFP to fork the parser.

In any case, it is becoming clear to me that many of those that claim that they produce valid HTML/4.0 are really saying that they conform to the subset of HTML that the W3C validator has chosen to selectively support.

Posted by Sam Ruby at

the “clear spec” referenced above “cannot be found.”

Fixed.  Thanks!

Posted by Sam Ruby at

I then added elements and attributes that I saw that were heavily used by SVG files I could find on places like WikiPedia.

Moron!

Posted by Mark at

Moron!

Ahem.

I’ll have you know that I read and implemented the SVG Tiny Profile.  (Though perhaps I should mention that I have declined so far to put foreignObject on the whitelist).  Then I have been gradually expanding the whitelist, time permitting, prioritizing my efforts based on actual usage.

And yes, I can pretty much rationalize anything.  ;-)

Posted by Sam Ruby at

<img>

Posted by anonymous at

David Pashley: Strict feed parsers are useless

Erich, I’m not entirely sure what you did to break Planet, but using a strict feed parser will just result in you missing a significant number of entries. People sadly don’t produce valid feeds and will blame your software rather than their feeds....

Excerpt from Planet Lupin at

Strict feed parsers are useless

Erich, I’m not entirely sure what you did to break Planet, but using a strict feed parser will just result in you missing a significant number of entries. People sadly don’t produce valid feeds and will blame your software rather than their feeds....

Excerpt from JD at

The inevitable result is calcification.  Software that was once, well, soft and pliable, has since become something you daren’t touch as you risk affecting the way that untold millions of feeds are interpreted.

My term of choice is Technical Arteriosclerosis. And I agree with the diagnosis, the speculation on the causes, and the pragmatic solution.



Posted by koranteng Ofosu-Amaah at

Sam Ruby: Bloglines Rocks!

Maybe I will switch back: Sam Ruby: Bloglines... [more]

Trackback from 42

at

Women in my newsreader

always read Tara Calishain Susan Mernit sometimes read Charlene Li Angela Beesley compare this to around 200 feeds in bloglines (which by the way, has gotten their act together). hmn, maybe I should resubscribe to Marry Hodder, I don’t quite...

Excerpt from Puzzlepieces at

Sam Ruby: Bloglines Rocks!

Maybe I will switch back: Sam Ruby: Bloglines Rocks!.......

Excerpt from 42 at

Women in my newsreader

always read Tara Calishain Susan Mernit sometimes read Charlene Li Angela Beesley compare this to around 200 feeds in bloglines (which by the way, has gotten their act together). hmn, maybe I should resubscribe to Marry Hodder, I don’t quite...

Excerpt from Puzzlepieces at

News from around the web

News worth reading: 3 posts from Shelley Powers: At one point the W3C was suggesting that people put dates directly into their URLs. That bad idea has finally died. At another time, there was a feeling that every project you did deserved its own...

Excerpt from Category 4 Blog at

RSS has been damage by in-fighting among those who have developed it

Syndication feeds such as RSS and Atom have the power to automate the delivery of all forms of digital content. The word “content” can refer to weblog posts or MP3s, the U.S. President’s last speech or the photos of your last...

Excerpt from Category 4 Blog at

Bloglines Rocks !

“...a fresh beginning for Bloglines”...

Excerpt from Public marks with tag bloglines at

Strict feed parsers are useless

Erich , I’m not entirely sure what you did to break Planet, but using a strict feed parser will just result in you missing a significant number of entries. People sadly don’t produce valid feeds and will blame your software rather than their feeds....

Excerpt from JD at

Hi guys.  Thank you for the comments and feedback.  If you like the Classic version of Bloglines, and have not yet tried our new Bloglines Beta, please visit [link] and sign in using your current Bloglines username and password.  Feel free to leave us feedback!

Christian from the Bloglines team

Posted by Christian at

Add your comment