It’s just data

Markup at the crossroads

Don Box: A few months ago, Sam and I proposed the use of XHTML in RSS.  I can't speak for Sam's motivations, but I can speak for my own.

My inspiration was the Well Formed Web.


XHTML in RSS : pumping water down the gas pipes

Regular browsers are fine for viewing content, but they don't give us the full pleasure of a good aggregator. What...... [more]

Trackback from Raw Blog

at

RE: Markup at the crossroads

Sam,
As I told Don privately I'm off two minds about xhtml:body in RSS feeds. On the one hand it fits in nicely with the concept of the Well Formed Web and the benefits we'd all receive if we have XML Everywhere on the other hand both content producers and authoring tools have shown reluctance in adopting XHTML. Quite frankly I think, doublely encoded content in description and content:encoded elements are a gross hack.

However I'm a realist and have figured out how to make all three content forms look like well-formed XML for my needs so even if the world decided Tag Soup in RSS is the way to go I'll be covered. I just hope it doesn't.

As for Danny Ayers post on why this is a bad idea, I'd suggest he learns how HTTP works as well as why people use news aggregators before criticizing this idea. A little research goes a long way.

Message from Dare Obasanjo at

Sam - I do think inline XHTML is a bad idea, and have yet to see any evidence that there will be any net gain from this approach. It looks like it's going to happen anyway, so I'd rather it was at least well worked out, and you and Don are probably the best guys for that.

Sure, I could be wrong about it, but stuffing things together like this gives me the same nerves I get sometimes when I try to take dodgy shortcuts in code. They usually works out a lot more work in the long run.

Dare - I am not aware of any practical means by which HTTP can directly allow the same detail and granularity of metadata you can put into a feed. I just reread what I'd written, thinking I'd painted with too broad strokes (that last-modified or something might provide the same functionality in the loose way I'd described), but actually I still think it's correct.

re. why people use aggregators - there is no need for there to be any difference to the end user whether the content is inline or at a separate location. If anything the latter would offer a better experience because potentially less bandwidth is required.

The Well Formed Web / XML Everywhere is a cute idea, but it doesn't explain how relationships between pieces of information will be conveyed. It seems to assume that a hierarchy will be enough to model everything. That's without considering the issue of the dirtiness of web data.

Posted by Danny at

RE: Markup at the crossroads

Danny,
You wrote
"If they are separate, then we can use information in the metadata feed to optimise the use of the content feed. If we've read all the items currently in the feed, we don't want to keep pulling the content every hour. We can tell this from the metadata alone."

This is not a problem. Look up HTTP 304 which most aggregators support.

As for the question on why people use RSS aggregators, the main reason I've seen many people give (myself included) is that they don't want to have to visit umpteen sites to get content just one. You seem to be denying this and working against this very practice.

Message from Dare Obasanjo at

Let's say we've read all the items in someone's blog. They delete an item. We have still read all the items in their feed, yet the data has changed. HTTP 304 Not Modified would not be appropriate.

This is admittedly an edge case, but I think the general idea is sound. The more we know about the content, the better we can deal with it. A lightweight bit of metadata might be enough to tell us that the potentially heavyweight content isn't of interest to us.

If the metadata includes the URLs of the umpteen sites, the aggregator can get the content easily enough. It could even inline it on the fly, if that's what you really wanted. There is much difference in effort for the tool builder, the perception for the end user can be the same.

Posted by Danny at

Whoops - last sentence should read "There isn't much difference..."

Posted by Danny at

"The Well Formed Web / XML Everywhere is a cute idea, but it doesn't explain how relationships between pieces of information will be conveyed."

Pardon me while I wax philosophical, but the relationship between the pieces, aka the 'meaning', will be transmitted in the same way it has alway been, by being observed by some thinking being. That restricts it to humans for now, maybe one day we'll have self-aware AIs, but until then meaning is what happens when information hits a human brain.

Posted by joe at

Danny,

In reading your opinions on this, I think you are conflating several issues here.

1) Some people want "full" content in their feeds, others want "excerpt/teasers" in their feeds. I wish people only did the latter, putting the burden for fetching the whole enchilada on the offline-mode aggregators, but I'm not everyone.

2) Neither Sam nor I are advocating putting everything into the XHTML bucket. All that's being advocated is that people use "real" markup when they want to do things like em and strong or any other rich-text-isms people want to use.  And by using a well-understood vocabulary (XHTML), people who want to add richer markup WITHIN their content have a reasonable way to do so.

DB

Posted by Don Box at

Joe, take a look at Archipelago's RSS feed.  The permanent link is XML encoded.  The creation date is in there too...

I specifically parse out the link in my extractor function.  I am good at regular expressions.

But there really is no reason why things have to be this hard.  There should be something more to this well formed web than simply putting <![CDATA[ ]]> around everything and relying on wizardry to achieve even simple results.

Posted by Sam Ruby at

Maintenance

I've spread myself too thin.  Inspired by Tantek's "What to do with things to do", I have decided to prune.... [more]

Trackback from dive into mark

at

Don, I take your point. Two issues.

I can't argue with the use of real markup, as long as it really is real - the required agent behaviour is specified somewhere rather than the assumption being made that an aggregator will be a dumb viewer.
Until fairly recently the non-CDATA markup would have been a problem with RSS 1.0, but I think the introduction of rdf:XMLLiteral may cover this.

re. teaser/full content : it's not an easy one! I've probably got 100 feeds in Amphetadesk right now, and many of those are full-content. This makes reading the material harder than it need be (ok, part of this is due to Amphetadesk's GUI, but still). Having teasers would be much better for finding the bits I really did want to read. But probably like everybody else I would like to have the full content available to read immediately. This would require a little more work on the part of the tool builder where the content is separate.

Though these are two separate issues, making it easy to put rich content in a feed gives the green light to putting full web page-style content in there.

There are potential problems mostly with bandwidth - if a server hosts 1000 blogs and 1000 aggregators are polling these every hour, the difference between teasers and full content could be huge.

I suppose this is one of those things that was/is bound to happen though. The concerted restraint required to do the meta/content split just wouldn't happen. Sticking a regular web page in RSS is just too easy. If there are problems further down the line then I suppose they'll get patched as well as convenient.

Posted by Danny at

Just noticed the comment from joe.
It doesn't need humans or self-aware AIs to convey a lot of meaning between separate systems, just a sophisticated framework for description such as RDF + OWL.

Posted by Danny at

Re teaser/full content: anyone who is reading in offline mode (I'm one, for example) will appreciate having the full content being available.

Today we do this by embedding full content in the actual feeds. It's just the lack of design in RSS that makes this an issue.

If each (item) in an RSS feed provides a URL for getting the full content in "pure" form (a single RSS item with full metadata, content:encoded, body:xhtml, etc.) it becomes trivial to get existing aggregators to do two-phase information download -- get a changelist, and then get the changed items. Think of it as a "real" permalink -- not to some HTML page, but to the item in its most "pure" form -- the RSS item with all its metadata.

Posted by Ziv Caspi at

Markup at the crossroads - getting there from here

One possible path...

Excerpt from Don Box's Spoutlet at

Add your comment