Introduction
This page is spun off from AggregatorBehaviorRules. Moot or archived discussions can go in AggregatorApiArchived.
Scope
The purpose of AggregatorApi is to provide more efficient delivery of Necho/Atom data to clients. In particular, only changes will be delivered along the channel.
Functionally, the end result should be at least as powerful as polling "flat" XML files.
Getting up to speed with this page/where we are
This section will contain a summary of issues that have been resolved further down the page.
Things that we have agreed upon so far
-
This will be an extension to the AtomApi, so wherever they lead we will follow (i.e. we're not worried about REST vs. SOAP)
-
We'll assume that polling is the default for now. If a Push or Notification standard comes around, we can piggyback.
Things that we have mostly agreed upon
-
We'll be delivering XML documents (as chunks of text/strings) that very much look like/are Necho feeds
-
The SuperAggregatorApi will extend this api to deliver many feeds, possibly using some sort of subscription model. cf. SuperAggregator. A simple base API will hasten adoption by lowering the barrier to entry. Discussion of SuperAggregatorApi stays on this page, unless it gets too unwieldy
Things that we are working on
-
use cases
Goals / Design Points
-
opportunity here to advance the way aggregation works (in the same fashion that we are trying to advance the way publishing works)
-
We're trying to maximise aggregator utility while minimising bandwidth usage
-
we observe that the main feed is never going to be the best use of this
-
conclude that leveraging the search/fetch part of the AtomApi is the best route (lest someone just shouts 'too complex')
-
Use cases include SuperAggregator, NewsMonster-like personal aggregators, [http://www.methodize.org/quicksub/ quickSub]... ?
-
Given we're now not talking about a straightforward HTTP request, we can't use HTTP expiry, so we should note prior art in RSS's <ttl> and similar elements
-
since we want to conserve bandwidth, we aim to only transfer new or changes items
-
we would like a way of saying "this item has been deleted", although it may not be honoured by the aggregator
-
a memento/cookie system? No, argument in MementoVsTimestamp. However the server tells the client the "completeness" of the request, eg saying "most recent 17 entries of total 317 in feed", or something? I'm not convinced that's actually helpful, given that I don't think you're going to be able to divine entry-ids or permalinks just by knowing a couple of them. Perhap2, or just warning='entry-list-truncated'. Needs more thought.
-
we want easy discovery of the API. Probably via the main Atom introspection document.
-
need the ability to query for the identifiers of entries which are new (or have changed) between a given set of datetimes
-
need the ability to retrieve individual entries as needed
-
need the ability to retrieve the content of an entry when needed and in the format in which it is needed
-
we should give thought to issues of non-public (account-based) aggregate feed services. This will likely be covered by existing parts of the HTTP spec (HTTP auth, Cookies) but the mechanisms for their use in an aggregator is unclear.
-
Note that MarkPilgrim's aggregator HTTP tests make reference to basic and digest authentication as optional features (and, indeed, to HTTPS).
-
Note JoshJacobs' comment on ReaderAPI.
Issues to raise with respect to editing
-
[JeremyGray RefactorOk] true syndication scenarios - One needs to bear in mind that very little syndication goes on in the RSS world right now, if one gets down and dirty with the meaning of the word syndication. Publishing, on the other hand, well, plenty of that happens. SuperAggregators would likely be the best, most immediate example of what I had in mind.
These should bring to light other important things that are being swept under the rug at present, such as:
-
universally unique entry identifiers, likely the URL that is the entry's API endpoint
-
the main syntax seems to be moving towards having a permanent link to a representation of the entry; this however isn't necessarily the API endpoint (it might be the HTML page), and so should be raised.
Aggregator use cases
We need a list of scenarios that we're trying to encompass in order to be able to accurately define the API operations we'll need to complete this work. Throughout we will ignore comments, trackbacks and so forth, since work is ongoing elsewhere to combine them with entries in the ConceptualModel.
Aggregator <-> SimpleProducer use cases
Efficient update of feed knowledge
An aggregator has an existing copy of a feed, which is probably out of date. We want to update the feed in a bandwidth-efficient fashion, ideally without being too processor-intensive.
(Note: to my knowledge, nowhere is anyone currently putting forward a concrete proposal for an expiry mechanism. RSS has some solutions, and HTTP has an expiry mechanism which can give hints but isn't really designed for consuming user agents. The requirement is noted in AggregatorBehaviorRules.)
When considering a feed, the following changes can happen:
-
new entry
-
modified entry
-
delete entry
-
expired entry
The last two are distinct: a deleted entry no longer exists (its permalink no longer functions); an expired entry is no longer considered 'recent' (ie: will no longer exist in the main feed, because it's too old / there are too many more recent entries).
I would argue that expiring entries isn't necessary here, because entries disappear off the main feed largely to avoid the main feed getting so huge that transferring it isn't feasible. There seems no point in wasting bandwidth to say "this is no longer on my front page" (effectively).
Aggregator <-> SuperProducer use cases
Efficient update of feed knowledge across multiple feeds
Similar to updating a single feed efficiently, if an aggregator is fetching several feeds from one site, it could combine the request for updates into one. Note that in the current EchoExample, there isn't a way of knowing what this feed is, so if they are all available for query via a single SuperAggregatorApi URL, there'll need to be a standard way of specifying them. (eg: [primary] URL of the main feed?)
Efficient transfer of feed request
If we're asking for many feeds in one go, it would be nice not to have to explicitly specify the feeds each time.
Aggregator <-> SuperAggregator use cases
I think these are largely going to be the same as Aggregator <-> SuperProducer use cases. There may be some specific ones, which can go here ...
Use Authentication to provide additional feed data
Support for feed retrieval API's by aggregators/news readers. A number of mainstream news organizations require registration (or subscription in some cases). These organizations would be more open to public feeds if the feed retrieval passed them the registration data that they apparently want to track.
A tag in the header of a feed could indicate that authentication was required, and then have an API for retrieving the feed with this additional data. It can be envisioned pointing an aggregator at a site to subscribe to it, and having a login dialog pop up on my machine. 'This site requires a user name and password', the user enters same, and the aggregator stores this information with the subscription. Subsequent feed refreshes use the API to retrieve the feed rather than the non-authenticated http request.
This would open up channels like the Wall Street Journal to providing feeds. It also provides the basis for a myYahoo feed.
[Refactored from a suggestion by JoshJacobs.]
Re-use of existing authentication mechanisms
For requesting the main feed document, it may be possible to use standard HTTP authentication mechanisms; certainly aggregators should implement this (as noted in AggregatorBehaviorRules) even if other solutions are preferable. Within the AggregatorApi, the authentication mechanism of the AtomApi we build in can be used. Hopefully this should prevent our having to invent new mechanisms, although we should ensure that any specification us clear on these matters, and that there are examples of such use.
AggregatorApi Operations
We need only a few operations:
-
retrieve(everything, within reason)
-
retrieve-changed-since-timestamp
-
[MartinAtkins : RefactorOk] What about 'get one entry with this ID'? Combined with entry-level ETags as proposed by someone else below, we could then have questions like 'get this entry if-none-match what I already have', which would certainly be useful in AggregateFeeds -- see the discussion about resolving non-authoritative to authoritative. Simple aggregators would probably like to be able to ask for updates on a particular entry sometimes; maybe the aggregator just got given from an aggregate feed a non-authoritative new version of an entry it already had an older, authoritative copy of. The aggregator will not want to replace the authoritative version it has with a non-authoritative copy, but it can go back to the originating server (assuming it has that kind of info stored) and check to see if the authoritative version matches the non-authoritative.
-
[JamesAylett RefactorOk] If you have the ID, why would you not just be able to fetch the entry directly (and use If-None-Match or If-Modified-Since at the HTTP level)? I'm pretty sure this is already supported by Joe's API. If not, having a "resolve ID -> URL" search point might be better than adding in part of HTTP's caching model, surely? It just strikes me as something that's quite fiddly to get right, and hence a burden on the producer - which is the last thing we want to add here, as it will quench adoption.
-
[JamesAylett RefactorOk] I think we're agreed that querying for all entries is useful. I think also we're agreed that having two variants of search, one which returns references only (presumably ID + URL? since they aren't guessing from each other), and one which returns inline entries to save on another call. That, combined with the spirit of Joe's draft (which has yet to fill out details of the search API) gives us most of what we want except for the changed-since-date functionality, doesn't it? Certainly the idea is to leverage all of the RestEchoApi search facilities.
Your extended search proposal I can't see easily solving the modified-since-timestamp problem, unless we transfer as UNIX timestamps rather than textual dates (or unless you mandate not only an XPath implementation but also some of the EXSLT modules). I'm also not terribly happy with the idea of telling producers that they need to support XPath in order for aggregators to be able to talk to them efficiently
[AsbjornUlsberg] Why is there a common perception that other dates than UNIX timestamp can't be sorted and treated as numbers/strings? ISO-8601 or W3CDTF which is the date format Echo should use, can be sorted as a string, hence it will be very easy to express "I want all entries since 2003-09-01". It will also be easy to convert any type of syntax for this expression into either XQuery, XPath or SQL, depending on what input-syntax we chose and what data engine the aggregator is built on.
[JamesAylett] Good point, I wan't thinking. Scratch the complaint about modified-since-timestamp beigng difficult to implement. Joe's theory about collapsing the editing facet of the AtomApi so that it doesn't need search is interesting, because it means that a trivial GET interface is no longer needed - so having variants as suggested in RestEchoSearchApi, with XPath one if a number of options, looks somewhat feasible; it should be possible to have the AggregatorApi as one or a choice of more than one of these variants. (To get there we probably need to a. refactor this page to DocumentMode and push to agree on our requirements, and b. ensure these requirements are expressed and addressed in at least one variant.)
[AsbjornUlsberg] I haven't read Joe's suggestion yet, but the sound of it looks neat. I will read it thoroughly on the subway on my way home from work. Dropping the GET interface sounds great, but won't it demand doing a lot of requests at different entry-URI's? As I haven't read Joe's proposal yet, you don't need to answer me before I have.
Discussion about what is returned
[KenMacLeod] I'm pretty much in favor of an API approach to querying (GET/POST with url-form parameters for subtyping the query). It should be noted, however, that some weblog software only produce static sites (FTP hosted websites, for example) or where dynamic queries would be a burden and they'd prefer to snapshot certain types of queries (a la "a syndication feed"). In those cases, some sort of static fallback should be defined.
-
[JeremyGray RefactorOk] We're not proposing to do away with the standard feed mechanism (a la current RSS). The idea of the AggregatorApi is to define and leverage a more granular approach for aggregators and servers capable of utilizing such a process.
-
[DavidJanes RefactorOk] In fact, one of many reasons I think this part of the API should return a Necho feed XML document is to leverage the basic code that reads the flat files.
-
[JeremyGray RefactorOk] That would definitely make good sense.
-
[AsbjornUlsberg] For a static fallback, XPath can be used on "all available entries". If we implement my "extended search" proposal, this would be fairly easy to implement, provided that an XPath engine exists in the given platform/environment/language.
[JeremyGray RefactorOk] Joe's current spec includes an XML representation of search results which currently appears to return a list of entry identifiers instead of full entries. There's probably room (and situations) for either type of returned XML, so perhaps it might be best to try to use the same search mechanism but with added controls that select the type of returned XML. Any thoughts?
-
[DavidJanes RefactorOk] I think we're all agreed that returning references is moderately non-useful. The AggregatorApi and the SuperAggregatorApi must return the content, as our goal is efficiency, right?
-
[JeremyGray RefactorOk] References are definitely handy in situations where the user may or may not choose to view a particular entry based on its title. I, for instance, read relatively few of the entries that show up in my Daypop feed, whereas I read each and every new entry from a number of other feeds. I can see it going either way, depending on the feed and the end user, and I guess it comes down to this: if "our goal is efficiency", how do we define "efficiency"? From a technical perspective, bandwidth, assumably. From an end user perspective, perhaps responsiveness to actions they take in their aggregator user interface. Both are likely important, and each justifies one of the options available.
-
[JamesAylett RefactorOk] I agree that we want both options. It wouldn't be difficult to extend Joe's draft to allow full entries to be returned based on a switch to the search call.
[DavidJanes RefactorOk] Check out the multiple feeds section of EchoFeed. I think this is what we're looking for in terms of a result string. I'm hoping to throw together a strawman soon, I can get some spare time. I was thinking conceptual of three levels of service: level 0 -- flat files, level 1 -- incremental delivery, but basically it's flat files effeciently delivered, level 2 -- "something more complex" -- being able to do things like deliver an updated comment or metadata within an entry without resending the entire entry. You'll probably want a better explanation than this, but I'm rushing for work
-
[JeremyGray RefactorOk] Would I be correct to assume that you are pointing to the multiple feeds section of EchoFeed from more of a SuperAggregator perspective? If so, that approach looks reasonable to me, though there are surely details within the examples that could use some discussion (i.e. entry and feed identifiers, the "location"/"also-at" stuff, etc.)
SuperAggregatorApi
operations
The basic operations for the SuperAggregatorApi are
-
subscribe
-
list-subscriptions
-
multi-feed variants of the main AggregatorApi operations
Subscriptions allows the client to select which feeds it is interested, so it does not need to be sent every time. This could be a "real" operation or it could be some sort of "user preference" (debate?)
-
[JamesAylett RefactorOk] What does this mean? I can't see it working except as a real operation - you define a subscription which gives you an opaque token you use in the query operations instead of listing all your feeds explicitly.
[DavidJanes RefactorOk] Well, the AtomApi does define user preferences, which either means there's a login or a token based system. By a distinct "subscribe" operation, I mean exactly that: an operation returned by the discovery Api. Otherwise, I assume the subscription list would be maintained in the "user preference" operation.
[MartinAtkins : RefactorOk] When I first started thinking about AggregateFeeds, I didn't really imagine the subscription part being part of the API. In many cases this is either non-configurable (as with Moreover) or configurable via an alternative means (LiveJournal's "friends list"). Any API for managing subscriptions should not be required to be supported by all super aggregators.
[DavidJanes RefactorOk] BlogMatrix could potentially be super-aggregating hundreds of thousands of blogs! Note that I'm not saying subscribing is the only way to do that, but I certainly see it as an important operation!
-
[JamesAylett RefactorOk] Apologies for being slow, but I thought the 'subscription' idea had the purpose of reducing bandwidth when querying many feeds? ie: in the most common case it's a way of reducing the bulk of a request for retrieve-everything-since-timestamp over many feeds over many calls to that API method by avoiding having to transfer the feed list every time? There's nothing in that that requires any user information, nothing in that that requires that we bind into user preferences. I'd suggest we keep these separate for the moment. We should aim to achieve our goals without getting carried away with ourselves (to an extend this sugests we shouldn't spend too much time on the SuperAggregatorApi right now, but I recognise the interest in it both from the SuperAggregator and the SuperProducer points of view, and the latter in particular is pretty important).
If a SuperProducer wants to offer user preferences including a feed subscription service, it can do so by extending the user config (5.6 in Joe's draft), perhaps having an element that gives the opaque token equivalent to getting it via the subscription method as I originally understood it. The user pref stuff in the current AtomApi is pretty simplistic at the moment, but even so it would be very easy to extend.
[DavidJanes RefactorOk] I'm not disagreeing with what you're saying, but isn't the person/software talking to the api a user? I'd rather have a separate subscribe option for clarity, but on the other hand it seems to have a lot of duplication with the idea of a user preference: i.e. something has to be maintained server side and the information is unique and customizable to a single client.
-
[JamesAylett RefactorOk] I don't see why anything has to be maintained server side. If I'm reading ten (say) LiveJournal feeds, then my aggregator will currently fetch ten RSS files from LiveJournal, once an hour (or whenever). Next we collapse that into asking for changes to those ten feeds once an hour. Then I start reading another ninety, which is where we need subscription, because I don't want to spend bandwidth asking for those 100 feeds every hour (and LiveJournal probably agrees). So I want a quick way of asking for them.
An opaque token that LiveJournal can decode to figure out the feeds to consider doesn't require that I have a LiveJournal account; it just requires that when my reading list of LiveJournals changes, my aggregator contacts LiveJournal once with the new list of feeds, and updates its opaque token.
Yes, you can wind this into user prefs as well, but I'd argue that (a) user prefs in the current spec aren't terribly well progressed yet; and (b) doing opaque tokens first benefits us more. It's simpler to implement, so it should yield more adoption amongst superproducing sites and (crucially) amongst aggregators. We can leverage aspects of the mechanism (even if the opaque token disappears) for a future user pref system. Just because the software talking to the api can be considered a user doesn't argue that it should be. If we design a system that leans on the userpref system, we require anyone who wants to take advantage of it to create an account. Let's encourage use of this idea, which should save bandwidth for everyone, without inconveniencing them by making server-side storage of the subscription a requirement.
-
[MartinAtkins : RefactorOk] Your 'opaque token' sounds like a Cookie to me. We already have specs for cookies, so if we just make sure all aggregators support them properly we can be set. The difficult part is how the cookie gets into the aggregator in the first place... but maybe that's where the subscribe event comes in.
I don't like the idea that subscriptions are considered a client-side item, yet stored on the server. If I decide one day that I want to use a different client to view my feeds I don't want to have to recreate my subscription list just because the opaque session cookie is tied up in the depths of my old application. I guess this is where a user account would come in.
I don't really see any harm in supporting both approaches, especially since they are both already part of HTTP.
-
[JamesAylett RefactorOk] My immediate reaction was negative, but on reflection using HTTP cookies sounds quite neat. You'd get them out of the SuperProducer by the subscription API call (which would be more like create-subscription-cookie). My main concern would be whether there's an appropriate HTTP verb to use here for REST, and whether we can find a good HTTP response code. POST and 204 seem favourite, although using POST for this seems a little strange. There isn't much choice, though.
Re: persistence between clients, can't we put the opaque token into the OPML file you export to get the subscription list across anyway? (Although having said that, since the client needs to know a little information about the feed anyway, and it can't get that from the token, you probably need the list of feeds in the OPML file anyway - in which case adding the token/cookie isn't much help.)
The other side of things - server-based storage - I'd see as using the same mechanism as create-subscription-cookie, but with authentication, and a different API call (perhaps just the edit userpref stuff in the main API?).
-
[GrahamParks] How about item-level eTags? The aggregator remembers the eTag (likely a hash) for each item it thinks is current, and submits them (HTTP post?) when it makes a request. The server then sends back i) whole items for eTags not mentioned by the aggregator. ii) Just a placeholder (eg <olditem>\[eTag\]</olditem>) for items that were mentioned. This seems much more fail safe than relying on dates. You'd need to do this for the feed-level attributes as well. Also, the format for the request probably couldn't be XML for this to save bandwidth.
-
[JamesAylett RefactorOk] What's the difference here between an eTag and an ID? From your description, I can't tell the difference. I'd guess the intent is to ask for "entry ID if-none-match ETag ..." (cf: HTTP). I don't know quite why this makes me feel uneasy; probably because I see it as a fair amount of work reimplementing something from HTTP I feel sure we should be able to leverage (can you not send an ETag with a response to a GET request that had query parameters?). Also, in the commonest situation ("give me everything changed since FOO"), we end up transferred more for no extra utility whatsoever. Even avoiding XML for the request, won't you end up with a huge list of ID/ETag pairs when we could just say "search?atom-modified-since=DATESTRING" ?
[MartinAtkins : RefactorOk] How does a client know whether it's dealing with a simple feed or an aggregate feed provider? Providing a UI in an aggregator to send subscription requests to a simple feed would be counter-intuitive to users, who shouldn't really have to know a great deal about what's going on under the hood.
-
[JamesAylett DeleteOk] I'm somewhat assuming that we'll get the basic AggregatorApi done first, and worry about SuperAggregator / SuperProducer support (including how to identify such, and also whether a producer is canonical for an entry/feed) for later. FWIW, I'd imagine identifying a SuperAggregatorApi instance via the introspection file, and whether a producer is canonical for an entry/feed by an attribute or element extension to the core Atom XML.
requirements
-
the API should be able to handle many feeds
-
possible use models for this:
-
an client program/Aggegator connects to a SuperAggregator to get updates for many blogs, thus avoid having to poll tens of websites for updates,
-
(much less used) a case where two SuperAggregators are communicating -- sort of a USENet news server model.
Discussion about impact of multiple feeds
[JamesAylett RefactorOk] Probably allow different timestamps for different feeds, eg: <feed id='...' last-updated='...'/><feed id='...' last-updated='...'/>. My rationale here is that if I'm doing an hourly (say) sweep in my aggregator, and I updated one five minutes ago but there's another feed on the same URI I'm going to update anyway, I'd probably want to bundle it in just in case (where I wouln't have bothered fetching the main feed again). This could happen with manual updates of feeds. Probably won't be needed for a SuperAggregator.
Find me a home
[MartinAtkins : RefactorOk] I consider it good that people are starting to think higher-level than HTTP. HTTP works well for atomic entities, but something more fine-grained would definitely be a boon for Atom syndication which has "entries", a smaller item than the "feed". Part of this is to realise that HTTP proxies aren't going to do as well as something more Atom-specific which has knowledge of the concepts of Atom and can cache up at the entry level. See AggregateFeeds for (hopefully at some point) discussion on an aggregation/proxying layer for Atom.
[JamesAylett DavidJanes DeleteOk]
-
(when we get there) in discussions of a mooted SuperAggregatorApi, define (and hopefully get into the conceptual model for the entire project) SimpleAggregator, SuperAggregator, SimpleProducer and SuperProducer. This will need to mention, AggregateFeeds as well (they're a combination of SuperAggregator and either SimpleProducer or SuperProducer, from what I understand).
-
[JamesAylett DeleteOk] I'd like to start getting some basic definitions of these down, initially on this page, because I really think they'll help our discussions. (Or at least make the arguments less verbose
[FrançoisGranger DeleteOk] I think some of you probably already read this http://www.jerf.org/irights/2003/09/10.html#a2331
-
[JamesAylett] This seems to have two suggestions:
-
Use rsync (in AggregatorBehaviorRules we've already noted the wisdom of RFC 3229, which does content deltas over HTTP - basically doing the same job as rsync)
-
Have a feed give directions to an aggregator on obtaining mirror copies, which could in some way provide load balancing. I'm not convinced by this - of the two approaches which spring to mind (both of which are used today for improving HTTP performance), round robin isn't effectuve load balancing, and ordered by shortest network path is difficult to calculate (and may still not be effective load balancing); and even if they are, there are existing ways of doing this that will probably work as well (DNS round robin; content distribution networks). However a modification of the detail might be worth considering: if feed A knows that SuperAggregator B takes its feed (this could be determined automatically by software, even), it can advertise this in the feed metadata; then aggregator C could choose to use B to grab this feed (especially if it already took other feeds from B).
-
Also, although the original article is concerned only with large-scale sites (which should be able to do one of the more architectural solutions), there may actually be something there for smaller sites.