It’s just data

Linkage

In RSS 0.91, link was a required sub-element of item.  In today's scripting news's RSS feed, link has been replaced by guid.  This SourceForge feed has guids that are not links.  In Archipelago's RSS feed the permaLink is encoded in the description, and the link is used for something else.

On my weblog, you can retrieve this blog entry by id, or by date.  In Joe Gregorio's weblog, you can retrieve blog entries by title.  Mark Pilgrim adds a date.

It is OK to have multiple "unique" ways to identify a resource.  When I have enough spare moments to think through the ramifications, I'll support titles in my URLs and probably make that the preferred external view.

Now let's standardize and simplify.  A well formed blog entry will have two universal resource identifiers (URI's) associated with it.  One, a permaLink, will identify the preferred external "name" for this blog entry.  The other, a postId, will identify the preferred internal "name".

Both can be http:// style URLs.  Neither are required to be URLs at all.  They can be URN's and simply identify things in a location independent manner.  You can tell the difference by looking at the scheme (the portion of the URI before the ":").

I suspect that most permaLinks will be http:// style URLs.

The permaLink and the postId need not be different.  In fact, they will often be the same.  But let's not introduce complex precedence rules or require the recipient to guess for the sake of saving a few bytes of bandwidth in the cases when they are.

So... let's call a permaLink a permaLink and a postId a postId.  Require them both to be present.  And make them both URIs.


Oh please please please can we skip the camelCase this time around?  I can't tell you how much grief I've gotten on the validator feedback address about the differences between textinput and textInput, isPermalink and isPermaLink and ispermalink, and so forth.

I vote for all-lowercase element names, everywhere, no exceptions.

Posted by Mark at

Er I'm feeling stupid.  What is the postId for?  Can we have an example please?

Posted by Tim Bray at

Dare Obasanjo

Sam,
I can't think of any reason why both permaLink and postId are needed. A permaLink is a URI and URI is a Uniform Resource Identifier. Secondly I agree with Mark about the annoyances caused by camelCasing (xmlUrl vs. xmlurl in OPML files)

Message from Dare Obasanjo at

I would have to agree with Mark on this one. One of the nicer things of XML is that it allows for hyphens in tags; I think perma-link and post-id look better. permalink is fine as it is, too, since it is hardly a composite word, right?

Posted by Manuzhai at

The permaLink problem

Sam points out a few different current uses of permaLink and suggests: permaLink is required, and identifies a preferred external...... [more]

Trackback from cwilper-blog

at

Looking at some existing content management systems, I see some rather opaque URLs out there.  So, could a single URI scheme rule them all?  I guess the answer is yes, it is possible.

Looking at the BloggerAPI, I see a different postId is used than the URI.  It seems to me that if the backend to the CMS is a database, and the method of publishing is static, it may not always be convenient to do the mapping the other direction.

I know that in Radio, an entry id is a small integer.  Does anybody out there know what the equivalent is for MovableType?  Blogger?  LiveJournal?  others?

Posted by Sam Ruby at

Agree with Sam though some clarification in the form of an example of post-id would be helpful. (I think I know what you mean, but I want to be sure.) The guessing guid imposed made matters less then simple.

Also agree with Mark. No camel case please.

Posted by Timothy Appnel at

MovableType uses small integers for id'ing posts, comments and trackbacks.

Posted by James Snell at

Is the internal name supposed to just be a unique identifier for the post?  Like a guid in RSS2?  If so, let's call it a uniqueId or something along those lines - internal name sounds like something else.

Posted by Greg Reinacker at

MovableType identifies entries internally by an sequential number which by default it uses to generate the external URL. You can override that though. Look at Pilgrim's URLs for an example.

Posted by Timothy Appnel at

Blosxom uses the filesystem as it's database, so post id's are generated from filenames.  I believe they're guaranteed to be unique only within a given path/category. (Rael, are you out there?)

Posted by d.w. at

Radio uses a four-byte integer as a post id, as does Manila.

Posted by Dave Winer at

The new Blogger system uses 18-digit numbers for postID, just to be annoying.

Posted by Phil Ringnalda at

I learned some years ago that if I don't understand something and I keep asking for explanations, people only rarely get mad and when they do, that's OK.  So, I'm sorry, but I still don't get it.  If I'm writing a weblog or code to generate one, why do I need postID?  If I'm writing an aggregator, same question.  What's it for?  Once again, an example would help.

Hypothesis: is this supposed to be a version tracker?

Posted by Tim Bray at

Blogger internally uses an 8-byte post id.
We'll probably use something like 'urn:blogger.com:12345678901' for the global post id.
BTW, I've added my vote and a short statement to the permalinks discussion.

Posted by Misha Dynin at

You want a separate postid from a URI because, say, you shift hosts, then your old URI dosent make any sense for you. This is especially true if as a user you have hosted with one provider then move to another, and its YOUR data, not the hosters, we are talking about. The software should maintain the map internally.

More generally, this is an issue with the RESTian approach too to web services. What happens if the URI of my noun(like my calendar, say) had to change? Redirection is definitely a possible option, but is protocol specific. In such a case too, having a GUID URN for atleast the namespace part of the URI would be very useful, methinks...

Posted by Rahul Dave at

Tim, keep asking why.  Keep pushing back.

In my weblog, this entry is likely to become known as (using relative URLs for the moment):

  /2002/06/24/Linkage.html

However, the id I know of it internally will be 1492.  That's the id I will likely continue to use for pingback, trackbacks, comments, etc.

Looking at the blogger API, there is a concept of a postId (sorry for the camelCase guys, we'll fix that in a subsequent scrub).  This is distinct from the permaLink.  I could try to swim upstream and require tools to retrofit this, but looking at the discussion on PermaLinks, I see two different concepts getting conflated.

My bias is towards a specific type of simplicity.  Not the type of simplicity where you have one slot where you shove disparate types of things, but towards a clear definition of what ever slot is for.  If we have people wanting unique URIs, and other people wanting unique URLs, require both, even if in a perfect and well designed system they will be identical.  As I said above, a little redundancy is worth it, if it eliminates guesswork and precendence rules.

Does this help?  If not, keep pushing!

Posted by Sam Ruby at

The blogging software I've written uses the timestamp converted to seconds as a postId.  That yields something fairly unique, and easy to generate

Posted by James Robertson at

I'm a little confused.

You start off talking about link and then jump to permalink and postid.  Are you proposing removing link all together?

This would be problematic because some resources will want to syndicate link information, but can't for whatever reason keep those links from changing. 

A weather report for example can provide a uniquely identifying resource name  (e.g. weather://location/date), but will want to syndicate a link like:
http://weatherservice.com/current_conditions

So link is still very much needed.

Postid is useful primarily for the publishing profile to be defined later, and not for the syndication profile, correct?  Useful for software which can't intuit its own internal ids from a published URL. (not so hard to imagine, depending on how you configured MT, its could be hard to turn permalinks into entry ids)

Also is there going to be a later discussion of the interaction between permalink and real guids?  That is to say, guids which are useful for syndication and aggregation.  Kevin Burton put together some thoughts on what that might look like:
http://www.peerfear.org/rss/permalink/1031620231.shtml

Posted by kellan at

As Tim's questions suggest, I agree this needs to be cleared up. I reckon there are potentially four different kinds of identifier that can be associated with an item : 1. a remote page (link or about, remote content); 2. the item in the blog front page (#this content); 3. the item's permanent home (permaLink) and 4. an representation internal to the CMS system. I too am not entirely sure which you mean by postId.

I definitely agree that URIs should be used as the identifiers wherever possible, preferably using registered schemes (whether or not they are URLs).

re. camelCase - what programming styles do people use? I usually write in Java, and the case of initial letters helps distinguish methods from Classes. Similarly, in RDF/XML you can usually tell at a glance what is a Class and what is property (though RSS 1.0 is an annoying exception). Extra information for no cost. CamelCase is also a lot more human-readable than alllowercase, though Mark's point about people getting it wrong is valid. Would it be too much extra cost to allow anyThiNg and for the tools to pre-normalise the case? I guess I'd personally opt for CamelCaseWithInitialCapital, but don't think it's such a big deal whatever...

Posted by Danny at

Example of how globally unique postID can be used to improve user experience in the syndication context:

A Blogger blog with syndication.  The owner decides to change archive format from monthly to weekly.  This breaks most of the permalinks.  If the aggregator uses permalinks as post ids, the read/unread information is usually lost as well.

If the aggregator uses postIDs, blog restructuring is transparent to the reader -- read/unread flags are stored on per-postID basis, and the postID doesn't change when the permalink is updated.

Posted by Misha Dynin at

I understand the scenario Misha laid out. Where does the concept of version come into play with the concepts of post-id and permalink? Is that a third element?

Here is the scenario, I have made an entry a few days ago where I made an incorrect statement on asked a question. I update the entry for those who happen upon it via Google. Is there a way for others watching my syndication feeds with their aggregators to know I made an update and if so how?

Posted by Timothy Appnel at

Danny - I am not talking about #1.  Some blogging systems use #2 (e.g., Radio), and some use #3 (e.g., MT).  So I am talking about #2 and/or #3 for the permaLink.  #4 is the postId.  Both are supposed to be globally unique.  And both should be URIs.

Misha - just to be clear... changing archive formats is likely to be a rare thing, right?  Overwhelmingly, when someone receives a syndication feed, the expectation is that any permalinks you see there are likely to be valid for some period of time?  Ideally, forever, but nothing in this world is guaranteed, right?

Posted by Sam Ruby at

Rahul, the original (or primary) URI is always a "good" identifier, regardless of whether the resource moves or can be reached via other URIs.

That may be the use-case for "ID" vs. "resolvable URI" (curralink, not permalink).  The "original" permalink is always the ID, whereever it goes, and Entries published later can point to them.

In a recent application I gathered hundreds of links, of which about 10% are "dead".  The permanent URI is still the original link, so I added a 'mirrored' field to indicate the curralink.

This may be a naming/definition issue between "identifier" and "where can I find this thing now".  "permalink" isn't necessarily the same as "where can I find this thing now."

Posted by Ken MacLeod at

The question is "What is the postid for?", and considering we have permalinks, there doesn't seem to be a need for postid. I initially agreed with this assessment, but after mulling it over for a while I've built up a picture.

With a blog system using categories, and camel cased title, a permalink expressed as a URL would be http://www.example.com/categoryName/camelCaseTitle . That's good enough to uniquely identify an entry at a specific point in time.

Later on, because of volume or a change of interests by the author, a new category is created, and the above post is better suited for that category, the permalink then changes (yeah, I know: "Cool URLs stay the same") to: http://www.example.com/newCategory/camelCaseTitle .

To a blog or an aggregator the above two posts would be considered different, because the permalink is different. But underneath, its the same post (well, identical). In that respect I suppose a postId is a good way of spotting these "duplicate" posts for one specific feed.

In my case I prefer a category based view of a blog, but too often a post can meaningfully belong to two categories. Two different permalinks sharing the same postid. Now I have a chance of not sending the same post twice (as a blog writer), or reading the same post twice (as an aggregator user).

Posted by Isofarro at

I think I kind of get it.  The perma-link will be the URI you'd normally expect to use to point at and reference the entry.  The post-id is an opaque string that is used (in some unspecified way) to identify the entry, within the context of its production system.  Right?

The problem here is that too many of the contributors are too close to the blogging-software problem space and are wiring their assumptions into their answers.  For example, at 'ongoing' I can see no reason whatsoever why I'd need or use a post-id.  I can see why I'd want some sort of opaque version stamp for each entry.

OK, here's the challenge: someone write a simple explanation in human-readable English, aimed at someone who's going to write (a) an aggregator and/or (b) a blogging engine, of what a post-id is and why you need it and how you use it.

Also, I keep asking for an example and not getting one.  This makes me suspicious.  Arguing about specific instances is much more fruitful than about abstractions.

PS: I'm with Mark: I like hyphens and the world has enough camel-case.

Posted by Tim Bray at

What if somebody has a post with a URL involving it's category?

Imagine, for example, a photoblog. I might have a picture of a Castle at sunset in Palma de Mallorca, Spain, and wish to file that picture under Castles, Sunsets, Palama de Mallorca, Spain, and Europe.

I may also have several pie feeds for each category, as well as a central one, say "latest pictures".

Each category feed could have a different url (eg /photos/castles/bellver08.jpg as opposed to /photos/castles/sunsets/bellver02.jpg) however, they all point to the same real picture, so when I want to aggregate my feeds together to form a latest pictures feed, then I could just produce an XSLT rule which checks based on URN rather than links. This means that post would then only appear once in the latest photos feed.

Similar things could happen if you consume separate feeds from the same site.

On a slightly separate note, if we are also using URNs to allow us to resolve a permalink that has changed for a post (as was suggested above in temrs of moving site or changing layout), we probably want to define a mechanism for this. We may also need to give the blog a unique ID that will help resolving at a later stage, by, say, optional registration at a central repository such as blogger.com

I don't think unique IDs for posts should be globaly unique, and neither should it be able to identify the blog of origin from the unique ID of the post. This limit's the blog programmer's options.

Posted by Moof at

Woops, I missed Sam's example.  He says:

"In my weblog, this entry is likely to become known as (using relative URLs for the moment):  /2002/06/24/Linkage.html

However, the id I know of it internally will be 1492.  That's the id I will likely continue to use for pingback, trackbacks, comments, etc."

Sorry, don't get it.  Sigh, at least you can feel confident that once you've explained it to me eight times, nobody however slow can miss it.  Why do you need two labels?  /2002/06/24/Linkage.html looks like an excellent URI to me, why wouldn't you just use that for everything?  WHy should anyone in the outside world ever have to deal with '1492'?

Posted by Tim Bray at

Yes, breaking permalinks is a bad idea, and I do not expect this to happen often.  However, the cost of providing unique identifier (essentially an opaque string) is zero for the tool developer, and post-ids are useful in other situations, most notably in the API.

A last-modified timestamp can be used for versioning.  If  both post-id and modified time are required attributes of a post, then given any two posts, it's possible to answer correctly if

A combination of (post-id, last-modified) uniquely defines a version of the post.

Posted by Misha Dynin at

I can't tell yet whether Dublin Core is lumped with RDF as a taboo topic yet, so trundling down that path I go...

Dublin Core has a design pattern where you specify "the most common usage" of a term, and then offer "refinements" (we'd call extensions) that can be more specific about what the term is.  I described this briefly in the context of dates in the wiki.

Here, we could define 'identifier' as "A URI for this Entry", with a comment "Typically the URI resolves to the Entry."

We can then provide refinements (extensions) that are more specific: current-location (where the link can be found now), internal-identifier (an internal id, represented as a URI), and maybe also-at (additional resolvable URIs).

The design pattern suggests that a client can "alias" or "dumb down" more specific terms to their more generic terms.

Note, Dublin Core does have an 'identifier' term for this purpose.

Posted by Ken MacLeod at

Ah, I should also have noted that the pattern calls for refinements that mean exactly what may be implied by generic usage, therefore permanent-identifier (a  permanent, unique identifier for an Entry).

Also, a link to Dublin Core terms would be good too :-)

Posted by Ken MacLeod at

Being one of those loons who wrote their own system, I'm all about the dates and times. Combining the month/date/year and hour/minute/second is how I get my post id's (ala  blosxom, file system storage) and it's also how I generate my anchors/permalinks/guids/what-have-you. Mine is a single user system, and the likelihood of me posting more than one item per second is very unlikely, but I suppose I could go finer-grain than seconds if I had to...

Oh, I also vote for LowerCase!

Posted by pete at

I can see a potential need for not requiring a perma-link. Some arguments have been made for the presence of feeds that don't correspond to a resource on the web, and thus can't supply a perma-link, ever. Is this just an edge case?

There was also some discussion that weblog publishing tools may need an internal identifier of some sort, to keep track of where an entry goes. From my experience with RESTLog this hasn't been the case, the URL that you GET, POST or PUT to has an identifier of the entry in it implicitly, and the server-side gets complete control on how to form those URLs.

Also, from experience with RESTLog, the required part of the perma-link needs to be context sensitive. For example, when I POST a new entry in RESTLog I don't know the perma-link for that item a-priori, it's the server that decides that and fills in the 'link' item after the item is created. So the perma-link might need to be required in the context of a feed, but not in the context of a publishing interface. I know I am getting a little ahead of the curve with that, just trying to throw out some perspective.

Now, Aggie can tell that an RSS item has changed because it keeps an MD5 hash of title+link+description. If the MD5 hash of an item is different from all the MD5 hashes it saw the last time it pulled a feed, it knows that item is either new or changed. Now with a guaranteed unique identifier, Aggie can  distinguish between 'new' and 'changed', and maybe do things like supply highlighted diffs to items that have been 'changed'.

That's why I liked the single unique identifier that MUST be a URI, and SHOULD be a URL.

Posted by Joe at

Tim Bray wrote:

/2002/06/24/Linkage.html looks like an excellent URI to me, why
wouldn't you just use that for everything?  WHy should anyone in
the outside world ever have to deal with '1492'?"

Good point. The /2002/06/24/Linkage.html is likely to be database unfriendly (especially if the post date is stored or queried as something other than '2002/06/24' - it may take a bit of SQL gymnastics to get the right results from a 'simple' query).

The reader and writer should never need to see the 1492 reference, but it makes database querying a little easier. (Call us programmers lazy :-)

Posted by Isofarro at

Joe wrote:

Aggie can tell that an RSS item has changed because it keeps an
MD5 hash of title+link+description.

I was toying with this idea earlier today. It is a neat idea on the face of it. The one tiny drawback (for me anyway) is that I sometimes need to go back and correct some spelling errors and grammatical errors (Usemod wiki calls this a minor edit, so doesn't necessarily show up on Recent changes) and a few times over the few hours. I'd rather that not be identified as a "new or updated post". On the other hand, when I add in some extra info, or new links - then its worth identifying it as a new post.

Probably the postId idea won't solve this sort of problem, but an updateDate could (changing the date only after a "non-minor" edit).

On the topic of naming Conventions, I prefer camelCase and no punctuation, but I can live without the camel and allow hypens so long as underscores are banished.

Posted by Isofarro at

I'm moving toward Tim Bray's point of view. It seems like post-id is an internal implementation detail that real world weblog software might need to maintain in a config file or DBMS and associate with the perma-link, but I still don't see why this needs to be exposed to the feed consumers.

Maybe it's a perma-perma-link that is intended to survive the archive being moved to a different Internet domain?  Sounds like a corner case to me that violates the "do the simplest thing that could possibly work" principle. Worrying about corner cases up front, and trying to make life easy for implementers rather than end users, is the road to complexity that a number of W3C specs have travelled to the ultimate regret of their authors. 

I'd suggest that post-id be put on a list of possible enhancements that will be considered if implementation experience suggests it's necessary.

Posted by Mike Champion at

Here's one reason that a permalink should not be used as the unique identifier: more than one permalink can map to the same post. For example, in Movable Type or TypePad, you can turn on multiple archiving methods (Individual archives, Monthly archives, Category archives, etc) at the same time. So a post can have multiple permalinks, eg

  http://www.example.com/2003/06/foo.html
  http://www.example.com/2003/06/#000045
  http://www.example.com/tech/#000045

Now, the system could theoretically map back from URIs to posts internally, but that is a more complicated matter, particularly since in the first URI above, there's nothing identifying the post ID whatsoever (in the above example, the post ID would be 45, assuming this is MT).

So, the point is: to me it seems less confusing if we have a globally unique identifier separate from the permalink (though they can be the same) to identify a post, eg 'urn:example.com:45'.

Posted by Ben at

If your blog software uses some internal id to track stuff internally I don't see any reason for it to show up in your syndication feed or for it to be used when posting comments or blog entries.

We are on the Web, URIs are the identifiers of the Web. If your database needs some int for faster querying then that is an internal implementation detail not something that should be foisted on the general populace.

Posted by Dare Obasanjo at

Joe writes:

... when I POST a new entry in RESTLog I don't know the perma-link for that item a-priori, it's the server that decides that and fills in the 'link' item after the item is created. So the perma-link might need to be required in the context of a feed, but not in the context of a publishing interface. I know I am getting a little ahead of the curve with that, just trying to throw out some perspective.

I think Joe nails a very important point that has tripped me up often during this past week. What context are we speaking about about?

Looking at the three identified areas or syndication, archive and API, other then the vague notion of content everything could be argued as being optional.

Posted by Timothy Appnel at

Ken,

Point taken..sorta ike namespaces where resolvability is not needed....

But say I own bla.com and have bla.com/blog/1.html#1 and tomorrow I let it lapse and you own it, and use the same blog scheme. Now 1.html#1 may be completely a different resource. How do I deal with this issue in the context of not using a postid?

Put another way, should a permalink carry the burden of resolvability also? Or only namespace? And let resolvability be a part of some external mechanism plus postid. Or even external mechanism plus permalink, which may be perfectly acceptable, though I think a postid makes it easier on aggregation/blogging/etc systems...

Posted by Rahul Dave at

While I fully agree that using only a URI may be the more elegant approach, I think we need to be pragmatic. As I understand it, this format is intended to be used as both a syndication/archiving format and as the data model for an API. In the latter case, it will be supplementing/replacing existing APIs such as the Blogger API, metaWeblog API, etc. Realistically, for the new format/API to be adopted by as many tools as possible, it needs to be flexible enough that support can be added without massive internal changes, and that may mean including the notion of a post ID.

Posted by Ben at

I like Ben.

The cost to the people who have implemented this elegantly: a tiny bit of redundancy.

I look at it this way: if you have such a legacy system, not having a postId is a show stopper.  I don't like show stoppers.

Posted by Sam Ruby at

OK, I see Ben's point.  BTW, at Ongoing, the categories and month/day snapshots have their own URIs and there's an HTML there that points to the entries by their perma-link; so you don't need a different URI to talk about a post.  E.g. http://www.tbray.org/ongoing/What/Technology/Publishing/ or http://www.tbray.org/ongoing/When/200x/2003/06/16/.  I'd probably be prepared to argue in favor of my way of doing things if we were designing this stuff from scratch.

But if the implementors say they need a post-id, then that's a good enough reason to have a post-id.  I still think the perma-link has special status as the "normal canonical way I'd like you to point at this from the outside world". 

So what's needed is a simple write-up on what syntactic rules govern post-ids (e.g. it's not obvious that they need to be URIs), and how they should be handled by aggregators, and how they could be used by authoring systems.  With examples.

Posted by Tim Bray at

I like the way this thread is progressing - feels forward ;-)

One slightly tangential thing that needs considering at the same time is : what is the title, description etc referring to? Before the rise of the blog-as-diary, it was clear - the thing being linked to ("Sam's New Project", http://samsuri) but nowadays the style has changed so that these refer to the current entry ("What I have to say about Sam's New Project", http://myblog). It's not a major problem in any sense, but I think it should be made explicit while we're talking about identifying stuff.

Posted by Danny at

With RPC interfaces, some kind of post ID is necessary. If a new RPC API standard is being created, can that take URIs as post identifiers?

I am making the assumption that blog software can / should translate between internal IDs and permalink URIs.

If this efforts wants to account for existing RPC API like Blogger, might that be better done through a translation service, i.e.: post -> NewAPI -> BloggerAPI.

I posted more comments on my blog here

Posted by Jay Fienberg at

My two cents. permaLink is the one and only resource that defines how to retrieve a given entry from now until the end of time. It is the master unique id. How I store the data behind the scenes shouldn't matter to consumers of my XML feed. if we want to add a separate postID (in the case that my backend is a relational database and I want to store the ID) that should be in an extensionModule (if we want to have a standard place for it) or in user defined metadata. The reason for putting it in a separate module, in my mind, is that the module itself embues the field with semantic meaning eliminating any doubt as to what the ID represents for the XML feed. I do not see a need for it in the same core namespace as permaLink.

Posted by Christian Romney at

Ben and Sam, you both make a compelling argument for supplying both a perma-link and a post-id. I'm sold.

Posted by Joe at

Could someone post a secton of "good" docs that "is a simple write-up on what syntactic rules govern [something], and how they should be handled by aggregators, and how they could be used by authoring systems.  With examples.", as Tim  describes.

Preferably on the wiki.  Not boilerplate (/me hates boilerplate), just a good sample to follow.

I also see a tool/service/format matrix in our near future... :)

Posted by Ken MacLeod at

Tim Bray wrote:

So what's needed is a simple write-up on what syntactic rules govern post-ids (e.g. it's not obvious that they need to be URIs), and how they should be handled by aggregators, and how they could be used by authoring systems.  With examples.

I'll certainly take that todo.  However, today (and for the next few days), I'm trying to identify on finding the areas of broad agreement and where there might be major issues to work.  Meanwhile, I'm still trolling for specific examples of how real products identify posts today.

Posted by Sam Ruby at

Forgot to add that I've voted accordingly:

http://www.intertwingly.net/wiki/pie/PermaLinks#head-dafe1d3d8883a9cabf1e064ba6f915fbf1ebd014

Posted by Joe at

A different case against postid's:

If we really standardize on a well-formed weblog entry than I can use the wellformedness to upgrade or switch to a different content management system and simply import my existing weblog.

A problem with allowing an opaque string to help CMS systems is that they are only valid within a particular CMS system instance. My upgraded/new CMS would be perfectly happy to serve the imported permalinks, but a "45" postid would become a problem. Is this an 8 byte integer postid from the old CMS that was cached somewhere or is this a valid small integer postid generated by the new system?

To avoid these kind of issues (in APIs, etc) you would then always want both the permalink and the postid to be present such that you could validate the postid. And if you would have to do that, you might as well drop the postid altogether. Or you want to force structure in the postid (MT:000045 or radio:45).

I understand the need for internal identifiers tho make life simpler for the CMS developers. But if the permalink is unique than the CMS should have no problem going from permalink to postid. A quick hash of the uri can be used to build an index to go from uri to postid. Multiple permalinks can map to the same internal postid. CMS developers can make the postid part of the permalink if they are really concerned about the overhead of access.

Using the postid to indicate a changed or updated entry is wrong. Use the last-modified-date (or use Joe hash trick).

Posted by Werner Vogels at

Should there even be syntactic constraints on postid's? Radio uses numbers, someone else may want a 128 bit clsid, etc, etc

I mean is it not enough to specify that the postid is unique within a namespace. After defining namespace suitably, ofcourse :-).

How its translated to a permalink is upto individual software, and come to think of it, an aggregator and a blog tool might even want to interpret it separately..eg, radio's number has an 'integer' meaning for Radio, but to another aggregator it may just be a unique string namespaced under the base uri used in the permalink uri's..

Posted by Rahul Dave at

Hopefully, this post ID is not expected to appear in syndicated feeds nor is need for posting comments/entries. I already find it fairly annoying working with the current morass of <guid> vs. <link> and would hate to continue it in this brave new world.

What would be extremely special is if postID shows up and has slightly different semantics from <guid> which we have now. :(

Posted by Dare Obasanjo at

Wiki

I find Sam's Wiki and related discussion interesting and potentially important. I find it a bit academic at the moment because I haven't seen or heard from the "Bigs" in our small pond. What do the Google/Blogger folks think? What does Six Apart...

Excerpt from Archipelago at

From a software engineering standpoint, postIDs are a code smell.  Private housekeeping data has no business being exposed like that.  From a practical standpoint - why do permaLinks have to be fully-qualified?  What's wrong with relative links, either relative to the path the feed was found on, or relative to a 'base' element?  That would take care of the "moving to a new host" problem.  It wouldn't solve the "moving to a new CMS" problem, but at some point you have to draw the line.  Just how much server-side screwiness do we force clients to deal with?  Seems to me the burden of maintaining the consistency of URLs is on the content provider, not the client.  If that means redirecting old URLs to new ones, so be it - and that has the advantage that it's the established way of saying "what was at that address has been moved to this address".

I can see an aggregator using a postID as a unique identifier to tell whether an item has been read or not, but this seems like more of a justification for a new formerlyFoundAt element than a justification for a postID.

Posted by Avdi at

A globally unique identifier for an entry seems much more useful than a local id.  Unless a global id can be inferred from a blogId+entryId combination, but that presumes the identifiers would use the same URI scheme.

It's a given that existing tools would have to syndicate differently, given a different format.  But where they use local ids, it doesn't seem to me that there's much of a pain in just providing global ids.

Posted by Chris Wilper at

This is a good thread. Tim Bray, I hope you question everything on the model, because you sure helped the discussion zero in on some good points.

I think Ben's point on the CMS needs for post-id is very well put. However, when I tried to think of interoperability examples that only a post-id would provide the support for, in an interoperability sense, I couldn't find anything a permalink couldn't provide. Sure weblogging tools such as MT use a entry_id to generate the default URI (which can be changed), and this is effcient from a database point of view -- but this doesn't have to be part of a global weblogging data model. It could be something useful for MT, but not used externally. Couldn't it? I mean, not all tools have to support the same thing internally, do they? Just compatibility between external interfaces and formats.

Basically, I'm agreeing with Werner.

Joe, I saw the page with the vote, but I'm not sure what the vote is for? I'm confused by the 'peas and carrots' and 'Moe' and other like references, and perhaps because I've not been following along on this discussion, have a hard time understanding exactly what that vote is for. Anyway one of you could translate what's being voted on, and perhaps translate why 'peas and carrots' and the rest? Into something for newer participants? I have a feeling there's some inside references going on here.

Unless, I'm more tired than I though, which I could be.

Good thread.

Posted by Shelley at

Shelley,
The peas and carrots thing isn't an inside reference.  It's just giving unique names to these identifiers we're talking about, for the purpose of not confounding the intended semantics of a vote option.  The bottom (discussion) part of that page on the wiki is where the names are assigned to the concepts, on a per-vote-option basis.

Posted by Chris Wilper at

Just tacking on a few late comments:

- JournURL uses long integers as internal post IDs.

- Personally, a post ID alone would not be enough to reconstruct a permalink after a move... I would also need a user ID and a community ID. Assuming a WFLE already has a user ID attached, this means that, for the purposes of archiving, I would probably need to concatenate the postID with the community ID. Something like "54276-6".

- I agree that the post ID is unnecessary for the purposes of syndication... JournURL doesn't produce permalinks that it can't use to backtrack toward a given entry. But IDs would be very useful for archiving, and perhaps even mandatory for an API.

Posted by Roger Benningfield at

Hmm, one thing in Werner's comment doesn't feel right to me : "Multiple permalinks can map to the same internal postid.". This seems back-to-front, surely the permalink should be a URI should be the identifier for the definitive version of a post. If there is to be a postID, then that should be the flexible entity. Locally variable, globally unique.

Avdi has a good point about exposing something primarily internal such as the postID smelling bad. But if it is key for legacy systems as Sam suggests then it does need to be considered -  after the essential modelling. Hacks for legacy shouldn't really be built into the core data model, should they?

Posted by Danny at

Danny, my 'multiple permalinks' example came from Ben's earlier comment. In his example a single posting can generate serveral entries; depending on category, archives, location, etc., each of these can be identified by a seperate permalink, if the CMS configure to do so. Each of these permalinks lead to the same posting but in a different context.

For example each of my postings result in a regular version and a mobile stripped-down version. Each have a different permalinks, but are generated from the same posting and have the same internal postid which is used for trackback, comments and updates.

Posted by Werner Vogels at

Danny: "This seems back-to-front..."

That's exactly how it works with JournURL... the post identified internally as ID #320 can be accessed by the world from:

http://journurl.com/news/users/admin/index.cfm?mode=article&entry=320
http://journurl.com/news/users/admin/index.cfm/mode/article/entry/320/
http://journurl.com/news/users/admin/index.cfm?month=06-20-2003&time=10:37:55

...and a bunch of others. When you mix in wildcard subdomain support, you end up with:

http://admin.support.journurl.com/news/users/admin/index.cfm?mode=article&entry=320

JournURL users can also share their entries, so entry 320 could also end up with a permalink in a completely different blog, even though it's the same content. And of course, entry 320 is also its own entity within the forum environment:

http://journurl.com/news/index.cfm?fa=skin.read&group=11&thread=104&message=320&date=all

There is no One True Permalink when your content is exposed through multiple outlets simultaneously.

Posted by Roger Benningfield at

Sorry, Sam... the linkify function doesn't seem to like URLs with unescaped ampersands.

Posted by Roger Benningfield at

Roger - fixed.  My linkify function is intentionally a bit on the conservative side.

Posted by Sam Ruby at

Sam,

Could you clarify what relation postid needs to have with guid, if any? Are you intending to do away with guid and have postid serve its purpose?

Posted by Rahul Dave at

Echo chamber

The format-that-must-not-be-named seems to be gaining support from all corners of the web.... [more]

Trackback from dive into mark

at

I think I'm finally getting it.  But let me spell it out in words of one syllable.  Suppose I'm a programmer sitting down to write an RSS aggregator.  I think the authoring systems want me to treat any 2 or more entries from the blog at http://example.com/blog/ which have the same post-id as actually being the same entry, regardless of their URI.

If this is true, then... STOP.  Is this true?

Posted by Tim Bray at

No CamelCase, please; yes to hyphens or underscores.

Posted by Maciej Ceglowski at

Re: is changing archive schemes rare?: depends on how you look at it. In the life of an individual blog, it's fairly rare: most Blogger blogs start out with weekly archives, find the list gets too long, and change to monthly, once. Many MT blogs start out with date-based archives, find that individual entry archives allow them to get rid of comment popups, and switch, once. But if you are looking at a large enough number of blogs at once, a change in archive scheme is pretty common.

A concrete example for when post-id would be useful in syndication: although most of my readers are only interested in posts about blogging tech, I also post about other stuff at times, and about MT stuff in several categories, and general blogging stuff in another. If I were generous enough to offer per-category feeds, I would probably use a permalink within the category archive, that being the context that someone subscribing to a single category would most likely want. But a person who subscribes to both my "Blogging Tech" feed and my "MT Hacks" feed would see an item that spans both categories twice, unless their aggregator noted that both items have the same post-id, and marked both as read when they read either one. I do have a single canonical permalink location, but in this case I might choose not to use it. Or, in my main feed, the permalink for an entry might be the link, fragment-free, but in the context of a per-entry comment feed, the link for the original entry might have a fragment (#c1, or #entry, depending), but the post-id would identify it as an entry that's already been seen in another context.

Or, for a more realistic but less concrete example, suppose you're developing a commercial program, and you have a public news feed that includes items about public releases of major versions, and a password-protected feed for developers and beta testers that doesn't include the news, but does include items for every major and minor version released. You don't want to use the same permalinks for both, since the private web page for items includes comments, but the public page doesn't, but you don't want your private feed subscribers to have to see the same item from both feeds.

Posted by Phil Ringnalda at

Tim:

I think the authoring systems want me to treat any 2 or more entries from the blog at http://example.com/blog/ which have the same post-id as actually being the same entry, regardless of their URI.

Speaking for myself, I'm not really wanting much of anythng at this point. I'm content to just describe existing behavior.

Right now. any entry in my blog will automatically show up in at least five different RSS feeds... the blog's main feed, a comment feed, a category-level comment feed, a community-level comment feed, and a community-level blog feed. If I've allowed another user to share my content, then my entry will also show up in her blog feed. And if my entry were a photoblog entry with an attachment, it might also show up in a whole 'nother set of feeds.

Many of those feeds have a completely different idea of what my "entry" is. My main feed thinks it's a blog entry. The comment feeds think it's a post to the forum. The other user's feed thinks it's part of her blog, but with a different dc:creator. The photoblog/attachment feed might think its just a purty picture. You get the point.

And all of these feeds have different ideas about how to provide a permalink. My blog feed will point back to my blog, naturally. The forum will point within itself. The other user's blog will point to her instance of the entry within her site. And so on.

Is it vital that aggregators recognize that all of these disparate items are really the same entry? Not necessarily. It might be nice, and enable some relatively cool stuff, but I don't think it's absolutely necessary. I won't lose any sleep if post IDs don't make the cut. But I can't say I see any particular reason for them to be left out, either.

Posted by Roger Benningfield at

Okay, so since there can be so many permalinks for one post, you actually really don't want the URI as a Globally Unique Identifier, since there are several URIs that point to the same post.

So one could argue that the permalink should not be a GUID, and that one needs the post-id to be a GUID, like <post-id>ongoing:SamsPie</post-id>, maybe. This is unique to the post, there will be only one of these for a post, so it allows for identification of that post. Part of the problem with the URI as a GUID is that there may not be another one of the posts at the URI, but there can be many URIs for the post, which makes matters difficult.

So then you just have the permalink there to allow for fetching the entry. I'd think that this would be easier not using #fragments since they don't have an end tag-like thingy, but that might just be me.

Posted by Manuzhai at

Werner, Roger and all, re. one true permalink - ok I'm convinced, multiple permalinks ok. I'm not entirely sure about using the URL as a means to structure in this way (serving up exactly the same data), seems inelegant somehow, but I guess if it works then I shouldn't knock it. What really convinces me here is the idea of the different views - e.g. HTML and the WML versions of the same content.

Which does rather change my position on post#id - I've a strong feeling that a 1-to-1 unique identifier is required somewhere, irrespective of implementation, let's say which identifies the concept of the entry. So if multiple permalinks are needed, then so is a post~id.

Posted by Danny at

echo or pie?

The Wiki for the "conceptual model of a log entry" that Sam Ruby started a few days ago (as I was mentioning here) has been gathering speed. After reading through most of the material I started contributing some of my thoughts today (after all, this...

Excerpt from d2r at

FWIW, here's how I did it in RSS 2.0.

http://scriptingnews.userland.com/2003/06/25#postIds

Posted by Dave Winer at

In a namespace - funky!

The backup and restore case is compelling  - a 1-to-1 identifier for the 'root' entry is desirable. (It doesn't necessarily mean breaking the permaLinks not having this though - if the permalinks are based on the date/time of the post then they could still be regenerated consistently).

Posted by Danny at

"So... let's call a permaLink a permaLink and a postId a postId.  Require them both to be present.  And make them both URIs."

Please do not make both required, I suggest just the permaLink is required, make the postID optional. I believe that to be successful this format should be easily hand rollable, meaning that each required element should have an intuitive meaning. I do not believe postId has an intuitive meaning, a question I would ask is "what intuitive meaning does a postId have to someone who handrolls their RSS feed?"

Talking about APIs and the like is all very well but not everyone who wants to provide a news feed of their items needs to implement all the bells and whistles implied in the adoption of an api.

Posted by Ben Meadowcroft at

Why make either mandatory? Isn't it possible that there may be uses for a purely 'live', transitory feed?

Posted by Danny at

Phil,

Thanks for two very nice examples for the need of postid, in addition to Dave's archiving example... So then, postid will supersede guid, and as mentioned, permalink dosent need to be unique for an item, but each permalink pointa to 1 item only..

Posted by Rahul Dave at

PS. you could still have a transitory feed if the permalink is a non-URL URI.

Posted by Danny at

"Please do not make both required, I suggest just the permaLink is required, make the postID optional."

If you just need a post-id for entries that have multiple permalinks, this makes sense. Some weblogging systems will only have one permalink for each entry, which makes the post-id somewhat redundant.

Posted by Manuzhai at

Rahul - supersede?  Think of it this way... what we are looking at is a blank slate.  We are trying to decide what goes into a new conceptual model.  In a later step on the roadmap, we will want to figure out what we are going to reuse from various places.

Will postid supercede guid?  I guess that depends on how you have chosen to use guid to date.  Depending on how you have used it, it may very well be that it is permalink that superceedes guid.  Or perhaps guid will not be superceeded at all - if we can identify a distinct concept from the other two that needs to be captured.

As it should be clear based on how I started this thread, my current thinking is that the two distinct ways in which a GUID can be used should be teased apart, and both of the resulting concepts should be expressed in a globally unique manner.

Posted by Sam Ruby at

Danny, using a namespace in that context is perfectly appropriate, it's why RSS 2.0 got namespaces, so you could add information to a feed that's application specific.

Posted by Dave Winer at

1. Can we use wiki pages instead of comments for these discussions? Wading through all these coments is a real chore; it'd be nicer if only one person had to do it and could summarize for the rest.

2. +1 to no camelCase.

3. I f postId is there for internal numbers, it shouldn't be a URI and it shouldn't be required.

Posted by Aaron Swartz at

Christian Romney

I find it useful to think of things in terms of an easily understandable analogy. Take for instance the question of linkage as it relates to the concept to trying to get to a physical location such as a house. My house has one address. Yet I can give people many different types of directions to get to that one singular address. One can approach from the North, South, East, or West. No matter what road one takes however, there is only ever one address. The question we must answer is what permaLink represents. From what I've gathered in the comments it seems some of us feel that permalink should correspond with the notion of address, while others seem to imply that permalink equates to directions. This needs to be clarified. An address doesn't tell you how to get somewhere, only where that place is. This is like a GUID or postID. Directions clearly specify a means of getting somewhere. Permalinks by the very nature of the word seem to imply a means of getting somewhere.

Another interesting item that I feel is slightly muddled (at least for me) is the concept of what is actually to be found at the address. For me an address identifies the location of the house, not a view of the house. Thus, the address of my house is the same whether I look at the house from the front, back or sides. The same applies to linkage. To me, what an entry represents is an idea or set of ideas. I may wish to view those ideas visually as HTML or consume them as XML and feed them into a text-to-speech engine and listen to the idea(s). The point is there is one address (id) for the idea, and multiple, optional directions (permalinks) for accessing prefabricated views or expressions of that idea. Therefore, I have reversed my previous position. +1 required postID +1 multiple optional permalinks.

Message from Christian Romney at

I for one, am still a little confused.  Maybe its just me.

As I understand it we have 3 distinct fields potentially in the mix: link, perma-link, post-id

Now if my application has a concept of fixed URLs (like most blogging apps) then I'm going to stick that permanent url into link. At which point the perma-link element is superflous, correct?

Now if my application doesn't have a concept of fixed URLs, then I'm going to stick something which works for now into link, and I'll want some other way of giving the post a unique identifier.  I wouldn't want to use a field called "perma-link" for that, I'm blogger, and I know that a perma-link should be clickable.

The RSS 2.0 spec deals with this by giving guid a n isPermaLink attribute, but that seems a little silly.  Why not use one field for linking, and the other for identity?

Which brings us around to post-id. 

Now I think acknowledging in the data model that just about every system has a concept of post-id makes sense.  Its basic to the data model.  Post-ids potentially though make lousy identifiers, as each system has a different internal mapping of how to identify a post. 

Also I remember working on several large web projects where one of our security guidelines was never to expose the record numbers of any the data to minimize the chance of someone being unable to munge the URL to get someone elses info.  So potentially while post-id is key to the data model, it shouldn't go out with your vanilla syndication format.

So, I still think we need a link element, and some sort of identity element, and I don't really see the facility of an element named perma-link for either of those uses.

Posted by kellan at

Aaron:

1. feel free to use the wiki.  However, that doesn't mean that things won't also be discussed in comments, hall conversations, whatever.  In fact, I expect the topic will come up at dinner on the 7th.

2. looks like a lot of people don't like camelCase.  FWIW, I'm not a fan of Initial Caps either.

3. a post-id should be a universal resource identifier (no caps - I'm talking conceptually here).  Ideally, it should also be a locator.  And there should be a clear an unambiguous way of determining whether it is a locator and, if so what scheme should be used.

Posted by Sam Ruby at

Christian: now place that house in a hall of mirrors.  In fact, place it in such a way that it can be seen at several different "places", perhaps the "real" house is not visible at all.

In this case, the house that everybody points to is merely a reflection.  Perhaps one of many.  People can point to a reflection and say "that one over there" without saying how to get from here to there.  And can say "I took a picture of that one" in conversations with friends.

Now take a look at these words.  They actually appear in a number of different locations.

Kellen: even though you may have started at a different place, I think we have are converging.  There should be exactly two concepts.  What's left to debate is whether or not a prefix of "perma-" makes things more clear.  Remember, this is a clean slate.

Posted by Sam Ruby at

Christian Romney

Follow up to my last post. I'd like to amend this line:
An address doesn't tell you how to get somewhere, only where that place is. This is like a GUID or postID.

Clearer is:

An address doesn't tell you how to get somewhere, it only expresses a unique location in terms of some agreed upon format. This is like a GUID or postID.

Message from Christian Romney at

Christian Romney

Sam: Even if the house is seen in reflection from many angles, wouldn't we (and shouldn't we) always refer back to the source in some way as we do with trackbacks? I think you have a much bigger picture here than I do. Could you please explain the implications of your analogy (fun house) as they pertain to linkage. I'm very interested in your thoughts on this.

Message from Christian Romney at

+1 post-id as universal resource identifier.

I've been confused to whether I was on the same page as everyone. This suggestion makes sense. A post-id that is not a URI may be helpful to the system that generated the entry, but it is almost useless to the outside world. As a URI I know that I can identify that item without any additional elements, precedence rules and so on. This should work out just fine because the system that generated the global unique identifier as a URI should generate it in such a way that they can resolve it back to a post-id or any other critical ID info if necessary.

Assuming I got this much right, what about versioning identifier? How does that related to the unique identifier? Is this a third element or part of the unique identifier?

Posted by Timothy Appnel at

Dare Obasanjo

Wow,
It's the URN vs. URI vs. URL debate happening somewhere that isn't an XML mailing list. Most XML-DEVers (and WWW-TAGsters) are probably having feelings of deja vu right about now. For some background see http://www.kuro5hin.org/story/2003/2/5/11349/85355

There is no reason why a postID/link/permalink should not be collapsed into a single concept and named with a URI. The only arguments I've seen against this reference internal database ids or moving sites around which can all be handled without creating two divergent identifiers especially in this world of relative URIs and xml:base

Message from Dare Obasanjo at

1.  Of course they will, I just think your series of discussions here should close with links to wiki pages, not encouragements to use the comments.

2. Well, you were using it particularly egregiously (permalink is one word, not perma link).

3. You've given me three "should"s but haven't answered my question. Why does it have to be a URI? Requiring that seems to get rid of its best use: a place for the database IDs from Movable Type, Blogger, and the like.

Posted by Aaron Swartz at

Dare,
Doesn't linkage imply navigation to a representation/view of data? If we rolled id and permalink together, how could you specify multiple representations given one unique, immutable URI?

Posted by Christian Romney at

Sorry, but I still don't get why we would need (much less require) a link, permalink and postid.

According to the current RSS 2.0 spec, the <guid> element does not have to be a permalink, so why not just use this to store an abstract postid? If you change ISPs or change the structure of your weblog, this will then only affect the <link>. If a <guid> is present, aggregators should use this to identify an item and should therefore not be confused by the changed link.

I recently changed the structure of my weblog by going from date based archives to individual ones. This changed all the <link>s in my feed, but my <guid isPermalink="false">s all remained the same, which allowed SharpReader (and hopefully other aggregators as well) to still uniquely identify all items in my feed.

Yes, if you do this you can no longer have 2 urls associated with an item (<link> and <guid isPermaLink="true">), but I never quite got why people would want to do that anyway since they're both supposed to end up on the same page, right?

Posted by Luke Hutteman at

Christian: the "source" may be a mySQL database on a server some place.  With a rather opaque key that is used as a unique identifier.  In fact, you may never see the source, you may only see reflections.

If I syndicate an excerpt, it would be helpful if I provided a link to where you can find the whole story.  In such cases, you might find it helpful if I point to a location where you can get a representation of my content, instead of the original source.

Many of the advocates of the concepts of "REST" would argue that in a well designed system, the source should also be a URL.  One that you can POST information to - information such as comments.  It is my intent that the outcome of this RoadMap will not only accommodate, but actually will facilitate, such systems.

How a client can decide whether or not this form of interaction is supported by a given server is a problem for another day, at the moment we are in the Conceptual Model phase.

Posted by Sam Ruby at

"Requiring that seems to get rid of its best use: a place for the database IDs from Movable Type, Blogger, and the like."

That doesn't seem entirely true... I could give an URI like http://www.manuzhai.nl/weblog/postid#256, and though it might not be resolvable (it's not), it is easy to extract the internal ID from it.

Posted by Manuzhai at

Luke: we don't need all three.  We only need two.  Now we need to decide what to call the two.

Apparently, link was not intended to be used as a permalink.

A new name with a precise definition would make things a lot more clear, IMHO.

Posted by Sam Ruby at

Dare Obasanjo

Christian,
Like I said, this is the URI vs. URL debate just taking another form. A URI is the superset of URLs and URNs which is primarily used as an identifier. It is encouraged that URIs should map to the addresses of network retrievable resources but they do not have to. 

Secondly I think you are misusing the term &quot;multiple representations&quot;. From my understanding , multiple representations refers to sending different bits on the wire depending on the capabilities of the User Agent (e.g. text/xhtml+xml vs. text/html)  not mapping a single resource and representation to multiple URIs.

It seems the question you are asking is if my weblog had URIs beginning with http://www.example.com/blog/technology/,  http://www.example.com/blog/work-related/ and http://www.example.com/blog/xml/ then I should be able to post the same entry in all three places and have one unique identifier for internal purposes and perhaps to let aggregators be smart about marking stuff as read if the user is subscribed to all three feeds. This feature seems reasonable but it doesn't look like a must-have which belongs in the core. It seems like this may just end up codifying what is current practice amongst folks that have both <guid/> and  in their feeds today such as http://www.luckypines.com/blog/rss.aspx

Message from Dare Obasanjo at

Permalinks? No, URIs, URLs, and identifiers.

<p>The problem is: there isn't a clear definition of what a "permalink" is. Some people are using the term "permalink" to mean any URL that points to or resolves to the Entry resource. Others are using it to mean a canonical or unambigious... [more]

Trackback from Ken MacLeod

at

Christian, RE mirrors,
  Ben, with a clear example, nails the need for unique identifier here:

  http://www.intertwingly.net/blog/1492.html#c1056551780

Think about it also from an aggregator standpoint, if you subscribe to more than one of Mark Pilgrims category RSS feeds:

http://diveintomark.org/xml/

You may end up seeing the same item multiple times if it cross-posted to multiple categories. A unique id fixes that problem.

And in response to Aaron, any form of unique identifer can be embedded in a URI, the advantage is having a uniform syntax. I talk more about that here:

http://bitworking.org/news/The_URI_of_a_Weblog_Entry

Posted by Joe at

I am  mostly bald, but the few hairs I have left are pointing in every which direction from the furious handwaving going on here.  Dave Winer this morning further confuses me by saying the main application of guids is backup/restore.  It's clear that there something here that people need, but crystal clear that it's not adequately explained.

I am now going over to the Wiki and creating a new page which will be called PostIdSpec, which will try to force people to converge on the necessary minimum of specification, which includes:

1. what syntactic constraints are placed on this thing

2. if/when producers of Pie/Echo are required to include it in an item

3. what consumers of Pie/Echo are required or recommended to do with it.

If you have consensus on these things, then you have a viable specification and we can move on.  If not, not.  Check out the Wiki in about an hour.

Posted by Tim Bray at

Permalinks? No, URIs, URLs, and identifiers.

On Sam Ruby's wiki, which describes what may soon be called Echo, and a thread on Sam's site, there is currently a thorough discussion of "permalinks" and "identifiers". The problem is: there isn't a clear definition of what a "permalink" is. Some...

Excerpt from Ken MacLeod at

Sam: If we only need two, and <link> and <guid isPermalink="false"> can already fulfill that role, why add new elements for this? I'm afraid it will just give users more ways of entering the same data, which will just add to the confusion.

As an aggregator writer, I will still have to support the currently existing fields, but now also support these new ones. And if they don't match, an aggregator can only guess at which one to use. We're already in this situation right now with <link> vs. <guid> which some feeds use to point to 2 different urls on the blog-website, some use the <link> as the external link and the <guid> as the internal, and some do it the other way around.

Adding more tags is not going to solve this existing issue as the old tags still need to be supported for backwards compatibility reasons, but may add more problems if (when!) they will be used inconsistently.

Better documentation of the existing tags may help though (though I don't agree with Dave's clarification of the link-tag, which goes against the way 95% of all feeds currently use this item)

Posted by Luke Hutteman at

Check out http://www.intertwingly.net/wiki/pie/PostIdSpec

If we can get this filled in, we can move on.

Posted by Tim Bray at

Dare Obasanjo

Luke,
What do you mean by adding already existing tags? What already existing tags? Have you been following the Wiki? This discussion isn't related to adding anything to previously existing efforts from what I gather.

Message from Dare Obasanjo at

Dare: I admit I have not followed the Wiki (this comment thread is taking up quite enough of my time already ;-) but what I gather is that we're talking about a link (that could potentially change) versus a post-id (which is constant and unique).

I don't see why we can't just use <link> and <guid> for this purpose instad of inventing new tags like <postid>.

Posted by Luke Hutteman at

Luke,
My apologies for the brusque tone of my post. Sam's site 404ed on me twice while I was trying to post my original response.

Posted by Dare Obasanjo at

Christian Romney

Dare,
What I meant by multiple representations is the following: Suppose I have an entry stored on some medium. That stored entry is my content. The HTML view of that content is one representation, and would look very different on the wire than the RSS or PIE representation/view of the same content. Each of these representations should have a URI asociated with it because the representation itself is unique. For weblogs, these URIs will almost certainly be URLs because the representations/views are network retrievable, but I'm sure there are applications where this need not be the case. The core content itself though, is a related, but different entity that by be bare of any markup at all and should therefore have its own URI associated with it. I think we're on the same page in terms of using a URI for everything, and that we both understand the difference between URIs and URLs. Where I think we have divergence in philosophy is on whether or not a view/representation is a distinct entity that should be referenced or identified via its own unique identifer (URI). Thoughts?

Message from Christian Romney at

Timothy,

The postid is useful to outsiders if it is guaranteed to be unique in the namespace scope. That statement holds regardless of
URI or URN, as a URI can always be constructed by a namespace plus URN. So it really dosent make a difference. For internal consumption, as long as there is a aggregator/etc specific mechanism to achieve a 1-1 mapping between the URN and URI, we will be ok..

Posted by Rahul Dave at

Case In Point

Part of the discussion about Echo centered briefly on case, or rather, on camelCase. The problem is not that humpbackedCamelCase is difficult to emit consistently. The problem is case sensitivity....

Excerpt from Cox Crow at

Syndication Reloaded and Revolution

Sam Ruby started a Wiki on The Conceptual Model of a Log Entry. On the issue of proposed new syndication format, which I presume it would be more than just syndication, there’s a Roadmap to New Log Format. The idea......

Excerpt from yowkee essential at

Well, crap.  I totally agreed with the required permalink and required postid until last night.  When I was building a RSS feed for source control change notifications, it became obvious that it was going to be extremely difficult to come up with a reasonable permalink for items that carried any value whatsoever (that is, a permalink related to the specific item, rather than a generic one).

Since we're requiring postid, I don't necessarily see an absolute need to require permalink any more.  Thoughts about this?  I'll take it to the wiki if you prefer, but I got the impression that when you bring a topic back here, this is where the "final" discussion should take place.

Posted by Greg Reinacker at

This would be a lot easier if we stopped calling permalinks that and instead just called them 'links'. A 'link' is a URI which points at the content which a particular syndication entry is syndicating.

An ID is also a URI, but it serves a different purpose. While a link is used to 'visit' the entry in its original context, the ID serves only to differentiate different entries.

The ID being a URI is a convenience, because URIs already provide us with namespaces. In a lot of cases the ID and the link will be the same. In other cases, there could be multiple links, one of which matches the ID.

However, the key point is that, in syndication, the ID is ONLY FOR IDENTIFYING UNIQUENESS. It should never be 'visited'. Thus it doesn't even have to lead to any content in particular: it just exploits the heirarchical namespaces that URIs provide.

I'm thinking along similar lines to how XML namespaces are identified: they are given by URLs which often lead nowhere, but can be considered globally unique because people only create them in their own namespace.

In this case, then, a weblog system might require URLs of the following structure:
http://www.wibblenoo.com/archive_may_2003.html#12344
but the IDs could look like this:
http://www.wibblenoo.com/entries/12344

The structure of this ID is completely opaque, but assuming everyone does it right they will be globally unique. The weblog/CM system, when dealing with syndication, doesn't do anything with these URIs except generate them in such a way that it knows they will be unique for different entries and always identical for the same entry in different contexts.

Of course, I'm only talking from a syndication perspective. In order for this to work for an API also, the unique ID must be in a form from which the CMS can derive what it needs to relate to a specific item within its internal data structures.

To take a real-world example, I'll pick on LiveJournal since I spend far too much of my time with it and know most about it. LiveJournal internally keys each entry on an integer userid and an integer called the itemid. itemids are per-user, so in order to uniquely identify an entry the system needs to know the userid (which is quickly derivable from the username) and the itemid.

LiveJournal uses links in the following format:
http://www.livejournal.com/users/username/12333.html
These are perfectly fine as both 'permalinks' (which I still prefer to call just 'links') and unique IDs.

However, LiveJournal also allows users to enter a DNS domain name which, when given in the HTTP Host header of the request, will make the user's journal load. These URLs look like this:
http://some.random.stuff.com/12333.html

Let's assume for the sake of example that these URLs both point at the same entry. The user has gone to the effort of registering and entering this domain and will want the links in the feed to reflect it. However, LiveJournal knows the entry by the canonical livejournal.com form.

Both of these are the same entry, so they need the same unique ID. However, they both have a different link. In syndication, this means that they will be considered the same if they both end up in the same aggregator. For a generalized protocol, only the unique ID provides what LiveJournal needs to identify the entry internally.

(Technically, of course, LiveJournal could look up the domain and match it to a user, but this carries more overhead than processing the URL which contains the username, and users can change their domain far more easily than they can change their username, thus breaking the unique IDs.)

Sorry this got a little long and waffly. I edited it down a lot after I wrote it, but it's still not as concise as I'd like.

Posted by Martin Atkins at

Greg, we are not in the final decision phases.  The truth of the matter is that there isn't one medium that works for everybody.  People like Shelley and Tim are doing an excellent job of abstracting up the discussion into more palatable chunks for people who can't manage to keep up with the wiki.

My surfacing points here is meant to serve a similar purpose.  Not to make final decisions, but to increase visibility.

Now, as to your specific question, what would be the problem with making the perma-link and post-id the same in this instance?  What I don't want is aggregators to have to guess or invoke precedence rules, and do believe if a little redundancy is necessary in order to make this so, then that is a tradeoff worth making, IMHO.

Posted by Sam Ruby at

Sam, the problem in my particular example was that I can build a postid (which would be some kind of non-resolvable URN), but I can't build a reasonable URL for a permalink.

I suppose I could throw the URN into the permalink field too, but I think permalinks should always be resolvable.

I'm not trying to be difficult - like I said, I was totally on board with requiring both fields until diving further into this particular project.

Posted by Greg Reinacker at

Please do throw the urn into the permalink field.  That way you have unambiguously said "this is non-resolvable".  The recipient doesn't have to guess as to why you didn't include a link field - you  clearly have told it.

Looking at the larger picture, there will always be schemes that the recipient does not support.  Most if not all will support http.  Many will support https and ftp.  A few may support irc.  And so on.

Posted by Sam Ruby at

Well, we could do that...the only problem I see then is that "simple" parsing tools (like a quick XSLT transform) then get a lot more difficult to write, as they have to tell the difference between resolvable URL's and non-resolvable URN's...

Posted by Greg Reinacker at

Two Identifiers

Every Echo entry needs two identifiers, which we'll call, for lack of better names 'post-id' and 'perma-link'. They need to be separate, and they need to be required. There is still a pretty heavy debate going on in the wiki and in Sam's blog about...

Excerpt from BitWorking at

I think there needs to be a second permalink to the blog post data in some relatively straightforward data interchange format like a RSS... [more]

Trackback from the iCite net development blog

at

Shoes For Industry

I've never understood just what the Firesign Theatre was getting at there, but somehow I get it anyway. And again, when Paul wrote that line in Hey Jude about the movement you need being on your shoulder, John had the presence of mind to intimidate...

Excerpt from Steve Gillmor's Emerging Opps at

In brief, anal sex edition

The Supreme Court upholds the right to have anal sex.  Also, some other less important news.... [more]

Trackback from dive into mark

at

I've been running a weblog for less than a year, and I have close to zero experience with syndication. I hand roll my weblog AND my very basic RSS feed (2.0). I am, therefore, speaking from a "doesn't know diddly" point of view.

I'm very interested in this Echo/Pie thang, because it will clear up some of the confusion that RSS causes. I've been following the discussion on links, and I'd like to seek some clarification on something.

I can see a need for a maximum of 3 kinds of identifier:

1. Persistent URL (a.k.a. permalink)
2. Temporary URL (front page of blog)
3. Some other ID (for content management systems).

I think that (1.) should be REQUIRED, and (2.) and (3.) should be OPTIONAL. All three should be standardized in some way, if at all possible.

Does this fit in with what everyone else is thinking?

Also, I am not a big fan of camelCase. I prefer to see alllowercase or all-lowercase-with-hyphens.

Posted by Simon Jessey at

I prefer underscores... some languages (Python) can't handle variable names with hyphens, but virtually all support underscores.

post_id looks nicer than post-id anyway. Ok, art is in the eye...

If you don't conform to the demands of Python Programmers Everywhere the Supreme Court will be a knockin'.

Posted by Mike Watkins at

Christian Romney

Simon,
&quot;2. Temporary URL (front page of blog)&quot; is completely unnecessary.
Two things in a feed can have a permalink. The feed itself and each item/entry in the feed.
Use permalink of the feed to specify 2.
Use permalink of the item to specify 1.
3 is the cause for debate, but most people seem to agree it is needed. Dare is a notable exception.
ex.
<feed><permalink>http://[front page o fblog]</permalink><item><permalink>http://[permalink for entry]</permalink></item></feed>

PS this example doesn't conform to any particular format (RSS or Echo) just to clarify for you.

Message from Christian Romney at

Thanks, Christian. That makes a lot of sense to me. Personally, I use just the persistent URL. An example from my RSS feed (note the redundancy):

<item>
<title>An alternative to RSS?</title>
<description>Sam Ruby has initiated a project to create an alternative to the
mess that is RSS.</description>
<link>http://jessey.net/blog/2003/jun/index.html#e25a</link>
<guid isPermaLink="true">http://jessey.net/blog/2003/jun/index.html#e25a</guid>
<pubDate>Wed, 25 June 2003 00:00:00 GMT</pubDate>
</item>

Even though it is simple, I still think it is messy.

Posted by Simon Jessey at

Simon, check out EntryIdentifier on the wiki.  The attributes named in the main section are a little wordier than necessary, but they're meant to be specific.  Existing terms can be reused, as long as their definitions are clear (see "Definitions -- Part Deux" at the bottom).

Your (1) is echo-identifier, (2) is echo-also-at, which can also be used for categories, and (3) is echo-publisher-identifier.

echo-identifier is comparable to "permalink", but I believe echo-identifier has a more concrete definition, as pointed out at the top of that page.  echo-publisher-identifier is (almost?) the same as "post-id".

Posted by Ken MacLeod at

I just noticed echo-location, a URI location on the wiki. Are we adding a <dolphin>...</dolphin> element then?

Thanks for the pointer, Ken. It is all much clearer to me now.

Posted by Simon Jessey at

Late to the party

One of the problems with using an RSS Aggregator is that it gives the illusion of allowing you to keep up with a practically infinite number of weblogs. Whenever you find a link to a new weblog with an interesting entry, the temptation is high to... [more]

Trackback from public virtual MemoryStream

at

Late to the party

One of the problems with using an RSS Aggregator is that it gives the illusion of allowing you to keep up with a practically infinite number of weblogs. Whenever you find a link to a new weblog with an interesting entry, the temptation is high to...

Excerpt from Luke Hutteman's public virtual MemoryStream at

CYBARBER

On RSS postId/Link/Guid or a
remind of Tantek's expose on Bed and Breakfast markup and Anorexic Anchors
Nov 2002 (http://tantek.com/log/2002/11.html#L20021128t1352)

My idea: RSS spec needs:
- for a channel element a bookmarklink to the default web(log)page and
- for the item element one bookmark element to the item's socalled (the permalink) which will mostly be the permanent archived version of the item shown in a seperate webpage

Bad Example from Dave Winers page:
(almost plaintext with some Bed and BReakfast mark-up, there is a link at the top of the item with a named  anchor it refers to just in front of it.)

"

Post IDs<a name="postIds">&nbsp;</a><a href="http://scriptingnews.userland.com/2003/06/25#postIds" title="Permanent link to 'Post IDs' in archive."><img src="http://www.scripting.com/images/leftArrow.gif" height="9" width="11" border="0"></a>

FWIW, in RSS 2.0, I thought there should be a core-level post ID element, but I thought there was a pretty good chance, based on experience with the Blogger API, that each tool would have a different way of expressing it.

The compelling app for post ID's is backup and restore. If I'm using RSS to back up a weblog, and if I need to do a restore, the post ID's must be preserved, or when I regenerate the site after a restore, permalinks will break. Also since......
"

Remark on the side:
-why use both the anchor and the link?  Like this, you might as well delete the whole anchor and just make the link its own anchor, like :

  <a id="postIds" name="postIds" rel="bookmark" rev="bookmark"  href="http://scriptingnews.userland.com/2003/06/25#postIds" title="Permanent link to 'Post IDs' in archive."><img.....></a>

In the RSS feed one item called <bookmark>http://scriptingnews.userland.com/2003/06/25#postIds</bookmark> or <guid/> or link <link/> or whatever word is used is then enough to navigate to the bookmark.

Better but not ideal example from this 1492 blog:
(compared to the unsophisticated scriptingnews example this one is lot better as the link is at the END of the item and the bookmark is (almost) at the START of the item. Here again empty anchor is used which is not sophisticated Anorexic Anchors)

"
<div class="comment">
<a name="c1056647974"></a>

Please do throw the urn into the permalink field.&nbsp; That way you have unambiguously said "this is non-resolvable".&nbsp; The recipient doesn't have to guess as to why you didn't include a link field - you&nbsp; clearly have told it.

Looking at the larger picture, there will always be schemes that the recipient does not support.&nbsp; Most if not all will support http.&nbsp; Many will support https and ftp.&nbsp; A few may support irc.&nbsp; And so on.

Posted&nbsp;by <a href="http://www.intertwingly.net/blog/" title="rdu57-27-066.nc.rr.com">Sam Ruby</a> at
13:19
</div>
"

Remarks:
The link at the END and the anchor at the START of the item is useful(when viewing the default blog page) as when there are several items on the page and/or the item has a lot of content(archived version enter from RSS),  clicking the link will bring you back to the Start of the item without need to scroll (not in scriptingnews ex.)
However instead of the anchor with name=... one should ideally use an ID on the class=comments container DIV for the item like:

instead of:
<div class="comment">
<a name="c1056647974"></a>

Please do  ...

This would be better:
<div class="comment" id="c1056647974">

Please do .....

<a rel="bookmark"  titel="Permlink to c1056647974 in archive"  "href="http://www.intertwingly.net/blog/1492.html#c1056647974">13:19</a>

Cybarber

Posted by Cybarber at

idiots....

Excerpt from moedusa at

Two Identifiers. Every Echo entry needs two identifiers, which we'll call, for lack of better names 'post-id' and 'perma-link'. They need to be separate, and they need to be required. There is still a pretty heavy debate going on in the wiki and in...

Excerpt from André Venter: Dev at

First I have to say that I hate the word "permalink". Second, I think we need to decide the following before we settle the post-id vs permalink issue:

  1. Should an Echo feed be "allowed" offline and off the web? E.g., can an Echo feed be sent as an email attachment without losing any integrity or context?

  2. Should we make use of the benefit that lies in XML-IDs, or should we drop it?

  3. Should an Echo entry be available for external reference?

My answers are:

  1. Yes, an Echo feed should be allowed to exist without a persistent  URL associated to it. It doesn't have to be located. If it is to be located, there's no point in having it's associated link (e.g. "permalink") as an URI. It should of course be an URL. I can't se any advantages of having either permalink or id as an URI if it isn't going to be retrieved. Espcially since the URI syntax makes it impossible to use as an XML-ID.

  2. Yes, I think so. The post-id should therefore start with a letter [a-z] and then contain nothing but letters [a-z] or numbers [0-9]. I also think the post-id should be globally uniqe, in the same manner as the MessageID of email and USENET messages. Though, they would need another syntax than the MessageID's, because the MessageID syntax (e.g. <Xns93ABAD69DCF63asbjorntigerstadenno@news.online.no>) can't be used as an XML-ID.

  The point here is to gather enough information about the posting system, the author, the origin URL etc., to uniquely identify a post, and then hash this information into a syntax that can be used as an XML-ID. The discussion on how to uniquely identify an author would help a lot in this manner.

  After the post-id is created for an entry, it should never change. It should be interchangable, and it should follow the entry wherever it is aggregated, displayed or used. Having a unique ID and not only a permalink give other systems the possibility to also uniquely identify an entry. This isn't only a nice feature, I think it is a must-have.

  3. Yes, of course. But it's also possible it can't be resolved, because the original entry lies in the "My Documents" folder on John Doe's computer. Therefore, a URL should (but not must) be provided for an entry, and the main URL to provide is of course an URL to the location of the original Echo feed. If the real source of the feed is in a MySQL database doesn't matter. It's the first instance of the Echo feed from the originating system that should be identified as the orignal feed (or entry).

But it's also extremely useful to have different views of an entry. Therefore, we should use something like the "References" header in the NNTP protocol. The life cycle of an entry could then be something like this:

  1. John Doe wrties an article about lemmings. He (or rather; his system) gives the article a globally unique post-id (or message-id) and also attaches any visible views he has on it. This is of course the Echo view, maybe an HTML view, maybe an RDF view, etc. These views may be referred via several <link> elements with appropriate "rel"s on them.

  2. Bob Smith consumes this Echo entry in his system. The post-id allows him to uniquely identify it, so whenever John publishes an update on his article, Bob can overwrite (or do some sort of internal version control) the entry in his system.

  3. Bob wants to expose this entry in some of his feeds. He does this in a straight forward manner, only he attaches his views of the article his version of the Echo entry. This gives the entry an consumer history, and the article can be traced back to it's origin because of John's attached views.

  3. Bob has a friend named Mary, which is very interested in Lemmings. She sees John's article on Bob's site, and consumes it. She consumes it from Bob's site, and not from John's, as the content is the same, and Bob's server are closer to her's. When she consumes it, she adds yet another view on the entry.

I think this is a likely life cycle of an entry. What's important is that the post-id don't give you any reason to persume that the entry (or feed) is available on the web. Because it don't have to. We have to provide a generic way to attach different views on an entry, and the default view is an Echo XML entry on a given URL.

My proposal to the "what if my site changes" issue, is that an entry shouldn't be locked onto a category. As an external reference, the category is of no interest as a part of the URL. If you need to know what category an entry is in, you should provide this in the feed as elements of some sort, not in the URL. The URL should be clean and simple, and however you categorize a feed, it's URL (or permalink) should be the same.

Ideally, "http://example.com/music/rock/23212.echo" should retrieve the exact same article as "http://example.com/23212.echo", but as long as we attach every different view of the article to the Echo entry, and describe the main and preferred view properly, it shouldn't matter.

My conclusion is therefore: Describe post-id as a MUST, and permalink (or whatever) as a SHOULD. Other views of the article is a MAY.

Lastly I'll comment the camelCase issue. CamelCasing doesn't work if the systems are case sensitive. They shouldn't be, of course. All lowercase is less readable, but easier to do right. Underscore is plain ugly. Hyphens are beautiful, and wether a programming language supports it or not has nothing to do with anything. There's no problem calling "post-id" "$postId" og "$post_id" in your favorite language. You can call it "$myMotherSmokesPot" for what I care. It makes no difference.

Posted by Asbjørn Ulsberg at

Add your comment