Atom article indirectly makes three assertions about what would
be ideal in a syndication protocol with respect to ids, which I
will paraphrase thus:
IDs are mandatory
the semantics on how/when IDs are to be generated and when they
should be copied need to be specified
the semantics on how IDs are to be compared need to be
One thing that is true of all current versions of RSS
and Atom is that #2 and #3 are underspecified.
No one should be surprised that feeds today are
syndicated. Or aggregated. Or that the results of
syndication and aggregation are themselves published.
However, given the current underspecification of ids, we often
find that it is the case that a number of "planet" sites (e.g.,
Sun) are copying content and
summaries from feed entries, but are NOT preserving identity.
The inevitable result: people who subscribe to these feeds will
see duplicate entries. And aggregator authors will get
The solution: spec text that conveys the requirement that ids
must be preserved if an entry is relocated, migrated, syndicated,
republished, exported or imported.
Note: none of this needs to wait on Atom becoming final.
People who build sites that aggregate content from various sites
should consider preserving ids if they are present in the source
feed. Perhaps the
Advisory Board should consider making a similar
The topic of comparison is secondary to all this but
important. Lacking any other guidance in specifications,
producers need to be aware that consumers will be free to perform
any of the comparison methods defined in
RFC 2396bis in order to lower the risk of false negatives.
If all programmers were
specifying a character by character comparison would be
sufficient. Unfortunately, history has shown that not
everybody reads specs carefully, and a number of
existing libraries are a wee bit too helpful. This would
make such a requirement a bit fragile. So it might be
worthwhile to consider a design choice that makes things more
resilient in face of such deviations.
initial read on consensus was that requiring canonical ids was
effective without being overly burdensome. Subsequently, in a
later read on consensus the requirement for canonicalization
was softened to a recommendation. Even this is still
provisional, it could change again.
In any case, what does this mean? To most people,
RFC 2396bis folks were pretty smart and picked a set of rules
that pretty much everybody on the planet are following
anyway. But if you do happen to pick an id that is not
canonical, your feed will be fine. The only problem that is
likely to occur is if one of those planet sites uses a library
which is too helpful. For that
reason, the Feed
Validator will be updated to provide a warning - just a
warning, not an error - if an id is not canonical. This
warning will be linked to a help page which will indicate that if
you are copying an id which is not canonical, you are doing the
right thing by preserving the id from the source feed character by
character. It is only if you are generating new ids should
you be concerned about
Again, this is not likely to affect very many people.
And just to repeat for emphasis: if you are syndicating content,
please preserve the identity of the entries. When comparing
ids, please do it by comparing character by character. When copying
ids, please do it by copying character by character.
Only when you are generating new ids
(just ids, not links, or html) should you consider
normalization. If you don't normalize, and
everybody follows the rules, things will still work.
But if you do normalize and somebody's library routine changes your
URI in any way, the Feed Validator will provide a warning on
I remember writing about parts of what you've written here on the Atom wiki. I think I was getting a bit ahead of where everyone else was, though, since everyone was still bickering about whether comments and trackbacks are entries at the time. I still think that feeds which aggregate other feeds are very important, and equally important is that clients can identify the relationship between the authoritative entry and the syndicated entry. This is more involved than just keeping the ids intact. Not all republishing feeds are or will be controllable; some will give you stuff from a fixed set of sources with no control whatsoever, and some might even provide only selected entries from hundreds of sources categorised by humans or software. The upshot of this is that I might end up subscribing to two feeds which chuck me the same entry.
This is alright if they are identical, but as soon as one is different I need to know which one is most authoritative so I can discard the others -- or rather have my software do so for me. The easy option is to somehow flag republished entries as non-authoritative, so authoritativeness is a boolean. Some might say it's valuable to have an amount of authoritativeness, but I'm not sure that's all that useful. As a side-feature, it'd be nice to be able to get a URI at which the authoritative version of an unauthoritative entry can be found, although that will of course be troublesome since entries have a habit of vanishing out of feeds after a while.
I can't actually remember where I wrote about all this in the wiki, since it was a long time ago. Still, I feel strongly about this distribution model as it's obvious that we need to move away from the model where thousands of clients all pull data from one source. The cascading aggregation model is a lot more like USENET's model, which was a good one.
As a last-ditch attempt to remain on-topic, I think I have to say that the only reliable way to compare URIs is by exact string matching. Sure, there'll be little oddities that spring up here and there, but if people don't cater to them I'd hope they'd be squashed pretty quickly, and if they do cater to them it's not a massive problem, as I would expect the incidence of someone publishing an ID of htttp://blah.invalid:80/ and a separate one of htttp://blah.invalid/ is very slim. The principle of being strict in what one produces and liberal in what one accepts seems to apply here. Specify the ideal, but always expect that people will screw it up and think about how much damage it'll do when they do. (not a great deal, in this case)
(I made up a fun new protocol because your comment mangler mangled my HTTP URIs)
Preserving Identity. Mark Pilgrim"s Identifying Atom article indirectly makes three assertions about what would be ideal in a syndication protocol with respect to ids, which I will paraphrase thus: IDs are mandatory the semantics on how/when IDs are...
I think there's a little discrepancy as far as RSS 1.0 is concerned - that spec says to use URIs (which would conflict with your char-by-char comparison), but since that spec's release RDF has moved to using URI References, and as the specs say: "Two RDF URI references are equal if and only if they compare as equal, character by character, as Unicode strings."
Danny, your link refers to the abstract syntax of RDF - separate and distinct from any concrete syntax (like RDF/XML). In fact there is even an example of the distinction between the two: the abstract syntax does not permit relative URI references, whereas the concrete syntax does.
Without trying it myself, I'm fairly confident that any .Net RDF/XML parser will conform to the abstract syntax by canonicalizing the concrete syntax.
Separate and different from the concrete syntax you say? Interesting considering that the concrete syntax spec links to that definition of URI reference when describing how rdf:ID works. See [link] and [link] for details.
On a quick inspection of the RSS feeds of Planet Gnome, Debian and Apache it looks like they are maintaining the <guid>/<link> elements of the aggregated items. Is there something more that they should be doing?
Dare, one would expect the definition of a concrete syntax to make a reference to the abstract syntax - but that does not make them identical. And the real question is whether it is clear to every implementer (not just Angels, but every implementer) that Uris are not meant to be normalized. Any RDF implementation which uses the System.Uri class in any way gets normalization "for free".
The concrete syntax spec makes a link to the spec defining the concepts behind [not just abstract syntax of ] RDF. It seems quite clear to me and everyone who I've worked with in the other place were URIs are specced this way (XML namespaces) that you are supposed to compare URIs as they appear in the source document.
The only difference between Atom and RDF or XML namespaces is that a lot more average developers will be writing code that processes Atom than developers who've had to write XML or RDF parsers in the past. Such people probably won't read the spec whereas anybody implementing an XML or RDF parser probably will.
For those guys there might be edge cases where canonicalizing URIs bites them on the butt [although the only ones I can think of are contrived unless you involve relative URIs] but the answer to their questions is fairly easy to answer. Use the string class not the URI class when processing Atom identifiers. It's what the folks implementing XML and RDF parsers have had to do as well. Atom developers shouldn't be any difference plus it adds consistency to the Web architecture.
Dare: do you know of any .Net RDF parser? Do any of them they make any use of the System.Uri class?
I don't know about you, but when such things happen, I would like to be able to do more than smugly point to the sentence in the spec that clearly spells out how their horribly broken their software is. I'd like to make producers aware of the tradition of search engines and system URI libraries (both of which are very much vibrant parts of web) to be slightly overzealous in their quest to eliminate false negatives.
Not with a mandate, a shall, or a MUST. But with a recommendation that this is something that they might want to be aware of.
Welcome to the world of standards development. As someone who's had to implement all sorts of unnatural behavior because that's what the specs say or told some customer they're app is busted because of some brokenness in some W3C spec I can feel where you are coming from.
However it seems you are optimizing for an edge case. Don't let edge cases dominate your design. It typically leads to unnecessary complexity and overengineering.
Dare: by your silence, the first thing I am going to assume is that you are OK with the requirement that IDs are mandatory. And with the requirement that IDs must be preserved - character by character - if an entry is relocated, migrated, syndicated, republished, exported or imported. And with the explicit requirement that IDs are to be compared on a character by character basis.
Your only quibble seems to be on a warning. To be produced by the Feed Validator. On what you openly admit is an "edge case". A warning that targets people who "probably won't read the spec". If it helps, I can promise to make sure that the help page directs people to use whatever string type they can find instead of whatever URI classes which might be available.
Scott has identified Yahoo! feeds as having this problem. I've also verified that the URI class in a popular platform is what I refer to as an "attractive nuisance". Finally, we are talking about a suggestion that requires absolutely no changes to RSSBandit.
It's just a warning. For a real problem. That requires no changes to RSSBandit to implement, but might reduce the number of duplicate entries that RSSBandit users would see with real feeds that exist today.
Sam: it might be that Planet Apache is using an old version of the planet code. I just downloaded the latest version, and added your atom feed as a test. The resulting generated rss20 feed included the following:
Good catch. I suppose it should be copying over the isPermalink value for rss2 feeds, and setting it to false for IDs found in atom feeds. It probably wouldn't be too difficult to do something like that.
Itâ€™s been proposed that Atom format mandates or recommends that publishers use canonical URIs. If this is accepted, then all the consumers have to do to get fairly accurate URI comparison is use string matching (i.e. char-by-char in the same...