It’s just data

Another Month

Jennifer Michelstein: <title>Academic features: citation &amp;amp; bibliography tools</title>

Deja Vu.

This problem is important to me because truth be told, specs matter, but only so far as they are followed.  For years, RSS had a validator that happily accepted feeds which were not even well formed XML.  We are still digging out from under the mess that that created.

The initial RSS specs were clear that titles were to be singly encoded, an no subsequent specification gave license to putting escaped HTML in titles.  The RSS-Profile once again makes this clear.  IE7, the Microsoft RSS platform, and Mozilla based products all expect titles to be singly encoded, and yet this practice is not universal as some products’ approach to standards is not to follow what the specs say or the consuming tools do, but to simply document how they deviate from the specs.

If that’s how the RSS 2.0 world wants to proceed, that’s fine with me, but I would prefer it if everybody who wants to produce such sloppy feeds stop providing them in Atom format.  Particularly as with Atom you are free to chose between three different ways to encode this — and I've personally seen all three in RSS 2.0 feeds — your only responsibility is to correctly indicate which one you chose.

Is that too much to ask?


Sam, tried out EarthLink’s reader yet?  I can confirm it doesn’t like your feed too much, but in different ways than I would have expected.  To be honest, I’m really not sure how it comes up with the results that it does.

Posted by Bob Aman at

On the subject of RSS titles: listing IE7, the Microsoft RSS platform and Mozilla as three separate products supporting single encoding is a bit of a stretch. Let’s face it, Mozilla hasn’t so much committed to single encoding as they have committed to following whatever IE7 chooses to do. And IE7 is just using the Windows RSS platform so that hardly counts.

If Microsoft were to change their stance tomorrow (which isn’t unthinkable - they haven’t actually committed to anything) then Mozilla would soon follow and you’d struggle to find a single major aggregator that supported single encoding. Right now the aggregator market is essentially split between Microsoft+Mozilla and everyone else. That’s not an easy decision for feed producers to make, but since both MS and Mozilla have not actually released their products yet, I don’t blame anyone for choosing “everyone one” (at least for now).

Posted by James Holderness at

To be honest, I’m really not sure how it comes up with the results that it does.

It ignores atom:content, it picks up atom:summary.  It picks up atom:title, but ignores atom:updated.  It doesn’t seem to like relative URIs at all.

What concerns me most, however, is that it doesn’t seem to be able to handle utf-8.

Count me as unimpressed.

Posted by Sam Ruby at

On the subject of RSS titles: listing IE7, the Microsoft RSS platform and Mozilla as three separate products supporting single encoding is a bit of a stretch.

Somewhere along the lines I got the impression that IE7 and the MRSS platform were separate implementations.  Based on a quick scan, it looks like I was wrong.  Perhaps I should have said RSSOwl.

Prior, more comprehensive tests showed that while some consumers “supported” some types of double escaping, different consumers supported different variations, and concluded that producers should stick with single escaping.

In any case, my intention was not to belabor the mess that RSS 2.0 titles has become, but rather to use that as an example of something I would rather not see Atom emulate.

If you want to double escape titles in Atom feeds, by all means feel free.  Simply add type="html" to the title element.  That shouldn’t be too complicated, or take months and months to implement and deploy.

What I don’t want to see is consumers Snarfer ignoring the default value for title attributes because “that’s just what the spec says, and here’s an example Microsoft site that does not follow the spec”.

Posted by Sam Ruby at

Mozilla hasn’t so much committed to single encoding as they have committed to

Mozilla hasn’t committed to anything. I have my opinion, and I like to think it counts for a lot on this particular issue, but the technical buck does not stop with me.

Posted by Robert Sayre at

Prior, more comprehensive tests showed that while some consumers “supported” some types of double escaping, different consumers supported different variations, and concluded that producers should stick with single escaping.

Actually those test were a lot less comprehensive. What Rogers was testing, and I was following up on, was how to escape a simple hi-ascii character. There’s no need for double escaping because there’s no chance of confusion once it’s been unescaped (it’s not an ampersand or angle bracket). Hell you don’t even need to escape it once - just UTF-8 encode it. It was a pointless exercise really.

My more recent tests were far more comprehensive. 175 tests and I tested every one of them on every aggregator I had. Results do vary from aggregator to aggregator - there’s no shortage of bugs and the double escaping crowd also use different levels of heuristics to cope with single escaping. However, other than IE7, Firefox and RSSOwl, I think it’s fairly safe to classify every other aggregator (that I’ve tested) in the double escaping group. That includes Bloglines, Newsgator, Google Reader, Attensa, My Yahoo!, FeedDemon, RSS Bandit, Sharpreader, etc., etc. Half of these companies have representatives on the RSS Advisory Board and they haven’t uttered a peep on the subject. I’d be surprised if they’ve even read the RSS profile.

In any case, my intention was not to belabor the mess that RSS 2.0 titles has become, but rather to use that as an example of something I would rather not see Atom emulate.

Yeah, sorry, I’m doing it again now. I agree completely regarding Atom. But I still need to deal with RSS and I find the current situation extremely frustrating.

What I don’t want to see is consumers Snarfer ignoring the default value for title attributes because “that’s just what the spec says, and here’s an example Microsoft site that does not follow the spec”.

Believe me I would endeavour to avoid that at all costs. My one exception so far has been for an Atom 0.3 issue which wasn’t entirely clear, every other aggregator was misinterpreting it in the same way, and it seemed an impossible task to try and get the rest of the world to comply with a spec that officially doesn’t exist.

Posted by James Holderness at

Hell you don’t even need to escape it once - just UTF-8 encode it.

Not if you want it to display properly in Radio Userland’s aggregator.  Or in EarthLink’s reader.

Once you leave the land where specs matter and enter a place where everybody does their own thing and it is up to you to figure out how to best fit in, things get complicated quickly.

I want to return to a simpler place.

Posted by Sam Ruby at

Hell you don’t even need to escape it once - just UTF-8 encode it.

Not if you want it to display properly in Radio Userland’s aggregator.  Or in EarthLink’s reader.

Fair enough. But there’s a slight difference between buggy code and a deliberate implementation choice that has been made for interop reasons. With the former, they have an incentive to fix their bugs because it will make things better for their users. With the latter, they have zero incentive to “fix” their implementation because doing so will make things worse for their users.

I want to return to a simpler place.

I would think everyone wants that. We just don’t agree on how to get there.

Posted by James Holderness at

But there’s a slight difference between buggy code and a deliberate implementation choice that has been made for interop reasons.

I guess the only thing left to say is that “you must be new here”.

Five years ago, most of the RSS 0.92 feeds had no charset or encoding specified, and most tended to be encoded as win-1252.  This is how the tools that produced such feeds formatted them.  It is how the feed consuming tools that accompanied such feed producing tools expected things to be.  And the validator at the time gave such feeds nice shiny badges.

And the advice at the time given to those that wanted to interop was that they must do RSS exactly as UserLand does.

Posted by Sam Ruby at

I saw that Earthlink was ignoring content and taking summaries, but it looked like they were trying to get relative URIs to work in some cases, but not others.  That was the main point of confusion.

Posted by Bob Aman at

Sam Ruby: Another Month

[link]...

Excerpt from del.icio.us/kangtime at

Add your comment