AggregatorApi

Introduction

This page is spun off from AggregatorBehaviorRules. Moot or archived discussions can go in AggregatorApiArchived.

Scope

The purpose of AggregatorApi is to provide more efficient delivery of Necho/Atom data to clients. In particular, only changes will be delivered along the channel.

Functionally, the end result should be at least as powerful as polling "flat" XML files.

Getting up to speed with this page/where we are

This section will contain a summary of issues that have been resolved further down the page.

Things that we have agreed upon so far

This will be an extension to the AtomApi, so wherever they lead we will follow (i.e. we're not worried about REST vs. SOAP)
We'll assume that polling is the default for now. If a Push or Notification standard comes around, we can piggyback.

Things that we have mostly agreed upon

We'll be delivering XML documents (as chunks of text/strings) that very much look like/are Necho feeds
The SuperAggregatorApi will extend this api to deliver many feeds, possibly using some sort of subscription model. cf. SuperAggregator. A simple base API will hasten adoption by lowering the barrier to entry. Discussion of SuperAggregatorApi stays on this page, unless it gets too unwieldy

Things that we are working on

use cases

Goals / Design Points

opportunity here to advance the way aggregation works (in the same fashion that we are trying to advance the way publishing works)
We're trying to maximise aggregator utility while minimising bandwidth usage
we observe that the main feed is never going to be the best use of this
conclude that leveraging the search/fetch part of the AtomApi is the best route (lest someone just shouts 'too complex')
Use cases include SuperAggregator, NewsMonster-like personal aggregators, [http://www.methodize.org/quicksub/ quickSub]... ?
Given we're now not talking about a straightforward HTTP request, we can't use HTTP expiry, so we should note prior art in RSS's <ttl> and similar elements
since we want to conserve bandwidth, we aim to only transfer new or changes items
we would like a way of saying "this item has been deleted", although it may not be honoured by the aggregator
a memento/cookie system? No, argument in MementoVsTimestamp. However the server tells the client the "completeness" of the request, eg saying "most recent 17 entries of total 317 in feed", or something? I'm not convinced that's actually helpful, given that I don't think you're going to be able to divine entry-ids or permalinks just by knowing a couple of them. Perhap2, or just warning='entry-list-truncated'. Needs more thought.
we want easy discovery of the API. Probably via the main Atom introspection document.
need the ability to query for the identifiers of entries which are new (or have changed) between a given set of datetimes
need the ability to retrieve individual entries as needed
need the ability to retrieve the content of an entry when needed and in the format in which it is needed
we should give thought to issues of non-public (account-based) aggregate feed services. This will likely be covered by existing parts of the HTTP spec (HTTP auth, Cookies) but the mechanisms for their use in an aggregator is unclear.

Note that MarkPilgrim's aggregator HTTP tests make reference to basic and digest authentication as optional features (and, indeed, to HTTPS).
Note JoshJacobs' comment on ReaderAPI.

Issues to raise with respect to editing

[JeremyGray RefactorOk] true syndication scenarios - One needs to bear in mind that very little syndication goes on in the RSS world right now, if one gets down and dirty with the meaning of the word syndication. Publishing, on the other hand, well, plenty of that happens. SuperAggregators would likely be the best, most immediate example of what I had in mind.
SuperAggregators

These should bring to light other important things that are being swept under the rug at present, such as:

universally unique entry identifiers, likely the URL that is the entry's API endpoint

the main syntax seems to be moving towards having a permanent link to a representation of the entry; this however isn't necessarily the API endpoint (it might be the HTML page), and so should be raised.

Aggregator use cases

We need a list of scenarios that we're trying to encompass in order to be able to accurately define the API operations we'll need to complete this work. Throughout we will ignore comments, trackbacks and so forth, since work is ongoing elsewhere to combine them with entries in the ConceptualModel.

Aggregator <-> SimpleProducer use cases

Efficient update of feed knowledge

An aggregator has an existing copy of a feed, which is probably out of date. We want to update the feed in a bandwidth-efficient fashion, ideally without being too processor-intensive.

(Note: to my knowledge, nowhere is anyone currently putting forward a concrete proposal for an expiry mechanism. RSS has some solutions, and HTTP has an expiry mechanism which can give hints but isn't really designed for consuming user agents. The requirement is noted in AggregatorBehaviorRules.)

When considering a feed, the following changes can happen:

new entry
modified entry
delete entry
expired entry

The last two are distinct: a deleted entry no longer exists (its permalink no longer functions); an expired entry is no longer considered 'recent' (ie: will no longer exist in the main feed, because it's too old / there are too many more recent entries).

I would argue that expiring entries isn't necessary here, because entries disappear off the main feed largely to avoid the main feed getting so huge that transferring it isn't feasible. There seems no point in wasting bandwidth to say "this is no longer on my front page" (effectively).

Aggregator <-> SuperProducer use cases

Efficient update of feed knowledge across multiple feeds

Similar to updating a single feed efficiently, if an aggregator is fetching several feeds from one site, it could combine the request for updates into one. Note that in the current EchoExample, there isn't a way of knowing what this feed is, so if they are all available for query via a single SuperAggregatorApi URL, there'll need to be a standard way of specifying them. (eg: [primary] URL of the main feed?)

Efficient transfer of feed request

If we're asking for many feeds in one go, it would be nice not to have to explicitly specify the feeds each time.

Aggregator <-> SuperAggregator use cases

I think these are largely going to be the same as Aggregator <-> SuperProducer use cases. There may be some specific ones, which can go here ...

Use Authentication to provide additional feed data

Support for feed retrieval API's by aggregators/news readers. A number of mainstream news organizations require registration (or subscription in some cases). These organizations would be more open to public feeds if the feed retrieval passed them the registration data that they apparently want to track.

A tag in the header of a feed could indicate that authentication was required, and then have an API for retrieving the feed with this additional data. It can be envisioned pointing an aggregator at a site to subscribe to it, and having a login dialog pop up on my machine. 'This site requires a user name and password', the user enters same, and the aggregator stores this information with the subscription. Subsequent feed refreshes use the API to retrieve the feed rather than the non-authenticated http request.

This would open up channels like the Wall Street Journal to providing feeds. It also provides the basis for a myYahoo feed.

[Refactored from a suggestion by JoshJacobs.]

Re-use of existing authentication mechanisms

For requesting the main feed document, it may be possible to use standard HTTP authentication mechanisms; certainly aggregators should implement this (as noted in AggregatorBehaviorRules) even if other solutions are preferable. Within the AggregatorApi, the authentication mechanism of the AtomApi we build in can be used. Hopefully this should prevent our having to invent new mechanisms, although we should ensure that any specification us clear on these matters, and that there are examples of such use.

AggregatorApi Operations

We need only a few operations:

retrieve(everything, within reason)
retrieve-changed-since-timestamp