ContentDiscussion - Atom Wiki

[Referring to the refactoring that produced the first draft of the version now in content.]

[AsbjornUlsberg, RefactorOk] This actually looks neat. I have to digest the examples for a while, but at first glance it looks quite complete and thought through. The use of "src" versus "href" in the <content> element may have some benefits as well. Great work.

Content modules poll and discussion

Place your name under the option you like best: [OpenPoll]

Zero or more content modules (0)
Exactly one content module that cannot be null (0)
Exactly one content module that can be null (8)

TimBray, DaveWarnock, JeremyGray, ChrisWilper, SteveKirks, KenMacLeod, SjoerdVisscher, AsbjornUlsberg

Content is optional, and if present uses or is derived from MIME (0)
One or more content modules that cannot be null (2)

TimothyAppnel, DiegoDoval

One or more content modules that are each derived from MIME (1)

HaroldGilchrist

Content, title, description each have MIME types and optional encoding, at least one of the three must be present (1)

MishaDynin

'Null' means present but empty, see NullValues.

In a syndicated feed (principle use of RSS), what goes into a required, non-empty/null "content" for sites that don't distribute content? BiblioGraphy seems to already cover title and summary. RSS' vagueness about <description> goes away: either it's echo:summary (an abstract or excerpt) or it's echo:content. content:encoded and xhtml:body are echo:content. In each of these cases, echo:content appears to be optional/empty/nullable in the case of a syndicated feed that does not syndicate content.

[TimothyAppnel] So, content should be optional since there are clear cases where a syndication feed will only contain metadata?
[DiegoDoval] Based on the 'bibliography' usage, I'm willing to move my vote to "One or more content modules that can be null. But regarding multiple content elements, the current EchoExample or the July 1 snapshot of the example show why having multiple content elements are useful.

See ContentDiscussion, ContentProblems, Fatal Flaw, Making encoding explicit, Meta Content Format, MimeContent, EscapedHtmlDiscussion, Escaped HTML discussion, EchoExample, ComponentBlog, NullValues, MultipleContentDiscussion, and SiteAndSyndication.

see AlternativeRepresentation: locations of alternative representations of the entry

[TimBray RefactorOk] I see no good reason for having multiple <content> children of <entry>. We need a compelling use-case or something very useful that you can prove you can't do before we impose this additional level of complexity on software authors.

+1 [TimothyAppnel]
+1 [DaveWarnock]
+1 [KenMacLeod] I presented multiple content items in the proposal below based on discussion, but I favor a single, optional content element.
+1 [ChrisWilper] If multiple items need to given in one entry, either 1) use multiple entries or 2) include each of those items in the content itself (btw I'm not assuming all content is html-like)
+1 [DanBri] Done properly, we'd need to re-invent HTTP ContentNegotiation to handle multicontent. More motivation needed.
-0.3 [RichardTallent, RefactorOk] Lack of multiple content items devolve into unnatural reliance on HTML, XHTML, SMIL, etc. to "package" the various related bits. Lack of essential complexity results in ugly hacks.

See references to multipart (and multipart/alternative) below. In some way, we would need to define what it means to have multiple content items, and it seems MIME has already done exactly that work.

-0.3 [RichardTallent, RefactorOk] Compelling use-case: publishing photographs in sets for critique. Even in an entry of a set with a single image, multiple sizes may be desired (thumbnail, larger, original) or versions.

What technique would you use to tell the reader which version to select to display? Multipart would handle the multiple content, and already provides some mechanisms to select between them, but whatever technique would work with multiple Entry content items would also work with a single content item containing a multipart.

-0.3 [RichardTallent, RefactorOk] "Related" should be a type of content item: comments, replies, translations, alternate stylesheets, bibiographic references, exam Q&A for distance learning applications, multiple format choices for multimedia feeds, etc. (Sorry for three bullets, but these are three distinct ideas.)

Please expand on this below under "Optional content" and "Extensions go in content". Specifically, what is the "content type" for those types of content?

HaroldGilchrist

RefactorOk

[KenMacLeod, RefactorOk] Harold, if I understand you correctly, my answer is that there is a seperation between content and information about content (metadata). In BiblioGraphy, echo:title and echo:subject are metadata, things about content somewhere. On this page, echo:content is the "thing" (the body of a post or a picture, for example). In RSS, <description> used to mean a short summary of a post, but often now is used to include the full body of a post. content:encoded and xhtml:body are used almost exclusively for the full body of a post. In Echo, as a proposal, summaries are echo:summary and are metadata (not content), full content is the <content> element. Please let me know if I understood correctly.

[HaroldGilchrist, RefactorOk] What I am saying above is don't give us less options to describe content than we already enjoy today in RSS. The ECHO definition of content has narrowed so much to the point I don't see how we can call it content. It should be called something like "content encoded". Others and I have expressed that we believe there also should exist a second type of content (a most likely others)in a feed named "content by reference". This is not a new invention in RSS as I have explained above (is available by using enclosures today in RSS). Putting file hrefs in chunks of html is not close to the same. That is as wrong as saying adding a BiblioGraphy module is the same as just adding the meta data title, summary straight into the content. To get past this, I propose that we split content into two: "content encoded" and "content by reference" (these names could change but the principal should remain the same). Two tags that are children of entry that can coexist just as description and enclosure can today in RSS under item. Making the split would give equal weight to each type of content and satisfy the meeting of what is offered by enclosure in RSS today. I don't think anyone would argue that this split adds any additional levels of complexity on software authors. At a minimum it simplifies the identifing of file types in feeds. Notice also the complication of "unlimited content" especially as it applied to "content encoded" has been removed.

Vote for the splitting of content into two here - One "content encoded" module that can be null and one "content by reference" module that can be null:

I agree completely. I think the approach you're describing is the same that several people are proposing, that this is handled by "profiles". There would be a profile that was just about metadata (for those that don't want to syndication content, for example) and a profile for syndication (for those who think aggregators are weblog browsers). In that case (in any case, likely), the relation to a weblog posting on a site is not "content" as described here, it's a relation of "this" entry record to its URI elsewhere on the web (the <link>). This content page is only about included content (either embedded or src=URI) for use in site maintenance APIs and syndication.

[MartinAtkins] Do included and referenced content really need to use a different element? Can we not have an element which has an attribute referring to the content, and when this is not present the content is assumed to be the content part of the element? I suppose this is against the current trend of putting URIs in element content rather than attributes, but there isn't really any difference in meaning between referenced content and included content: the only way they differ is how the content is obtained.

[HaroldGilchrist, RefactorOk] KenMacLeod and I chatted on IRC about ECHO content. Here is a link to the summary of the chat session

[TimothyAppnel] In the early days of this wiki the point was made that an entry is nothing without content. Therefore I am puzzled by the notion of a required content element that can be null. It seems to defeat the purpose and is a contradiction of terms, does it not? I am also still confused by what is content really and belief may be at the root of endorsement being "off." I believe a description and title is both helpful and necessary and should be metadata. Its been classed as an extension module that is (is not?) content. Fair enough. Why would I make an entry that is author, permalink and data with nothing else? How helppful is that? Can some clarify what I see as a contradiction of terms and unclear defintion? A use case for null content module perhaps?

[DaveWarnock RefactorOk] Suppose the entry was an announcement of a video. The entry might just be title, and link as you would not want the bandwidth cost of sending the video out with the feed. Or suppose you were announcing a new release and wanted to distribute using a peer to peer method such as Bit Torrent.
[TimothyAppnel] Fair enough, but is title content? Why is content required if it can be null? The current well formed entry requirements seems inconsistent and expose less then helpful scenarios. Since title is not required and assuming it is not content (as your reply implies) what if a developer decided to just provide a link, date and author? Perfectly legal under the current requirements, but not useful or neighborly to consumers -- I have to download that video to know what it is.
[DaveWarnock RefactorOk] I tend to agree that either title or content can be null but not both. But given the variety of possible uses I think we should do little guessing about what combinations of elements should be permitted. The mimimal cost is that someone might send out a useless feed with all elements null. The benefit is flexibility for new ways of using echo.
[KenMacLeod, RefactorOk] Do you equate "optional" with "null"? I see no reason for an empty content element.
[TimothyAppnel] There was an earlier discussion on permalinks and unique identifiers where it was discussed whether one could be both unless labeled not then it would be the other. (I know confusing which is why it was struck down.) Consensus was reached that the two should be seperate and distinct ever if some data redundency is introduced in some feeds. I recall Sam stating that he believed the least amount of guessing on the client side was best. I see A LOT of guessing in the way content has been specifed and don't like any suggest that reads both are optional, but one must appear. RSS 09.2+ did that and it sucked. In trying to write a simple aggregator I had to deal with some many expcepts and combination of what an entry may include that it was maddening. Taking that even further (content can be anything) will suck even more for people trying to develop apps to use these feeds and not fail its users. I'm all for new ways, but not at the expense of consistency and baseline reliability.

[MartinAtkins] With entries sporting MIME types and multiple alternative versions of content, desktop aggregators are going to have to match the functionality of web browsers and HTML email clients in picking out the MIME type they can handle best and rendering it either internally or using some kind of plugin. Since most aggregators are based on web servers bound to localhost, I suspect entries which don't have a text/plain or text/html version would simply get output as an <object> element in the hope that the browser would handle it sensibly. How it'd decide which of the unsupported MIME types is best to write out as an <object> is an implementation detail, of course. I guess the other alternatives could just be linked.

[KenMacLeod] I believe between BiblioGraphy saying the metadata is "summary" and us here saying that what RSS calls "description", content:encoded, or xhtml:body, goes in "content" makes that unambiguous.

[MartinAtkins] It'd be nice to be able to round-trip from LiveJournal to necho and back again with no loss of data, which means we need HTML subjects (or 'titles' as necho calls them). This implies the need for a type on the subject, although perhaps we can define a reasonable default for the sake of avoiding output bloat? (Reasoning is so that all LiveJournal-based sites which do syndication to be able to syndicate their entries between each other, thus creating the illusion that the users are present across sites.)

Optional Content

[KenMacLeod] The use-case for optional (and possibly a required but allowed to be empty) content is that an entry may consist entirely of its metadata and no "body".

Extensions go in Content

[KenMacLeod] The alternate case is that much of what we're calling "metadata" should be declared as "content" and placed in the content element, much like SOAP envelopes do with document message bodies. If so, that will change the model of content and require, possibly, some other way of indicating content type, including determining the intent of the "content body" solely by its namespace within a content type of "application/xml" or the use of URI media types.

Possibly Stupid Question

[JonathanSmith] Do any of the above choices allow a web site to distinquish between high bandwidth and low bandwidth, and, if so, which is the best choice?

KenMacLeod

high bandwidth

low bandwidth

External Content Length

Should content have an advisory length attribute for referenced content? [OpenPoll]

Yes:

LeonardoHerrera

No: KenMacLeod, TimBray, JeremyGray, JamesAylett, AsbjornUlsberg

What is the need for 'length'? can't that be determined in almost all cases by querying the resource (not usually even necessarily retrieving the resource)? Of type, language, encoding, and length of a referenced resource, isn't the 'length' the most likely attribute to be incorrect over time, making it just a guess? By comparison, a good argument can be made against allowing type, language, encoding, and then length too, when using src.

[TimBray] We are not here to invent cool new stuff that might be useful.

[HaroldGilchrist, RefactorOk] If this data is in the feed, a determination based on file size could be used by remote process to determine if file is to be downloaded. Also, most news readers today use only the information in the feed for viewing by the user (of course this could change with the addition of referenced content). The information would be used by the user to assist in determining if they download the file. I guess we could probe the file (I don't know how reliable this is) with another call but does the size of the file cost us that much (since we already probably have it) to have it in the feed? This attribute could be optional.

[SteveKirks, RefactorOk] Agree with Harold above. Handheld devices, especially cell phones could make determinations on what content to download based on user prefs. Use of the file size ""length"" above would permit this.

[LeonardoHerrera, RefactorOk] I support this stuff. I don't see many applications using it, but it can be useful in the handheld examples mentioned above. Not a big deal, it's pretty ignorable. My only observations are a) make it optional, and b) clearly state that this attribute is an approximation of the actual file size, not a definitive value to rely on. This way, handheld apps still can use this, and we avoid any security/reliability risks. (Here's a somewhat related thought: what about CRC?)

[JamesAylett, RefactorOk] This is metadata about the representation of the linked content. Given that this representation may change completely independently of the referencing document (the Atom feed in this case), putting it in the referencing document is dangerous. It's like putting an advisory type attribute on a link to a URI you don't control; what is the user agent supposed to do when it completely mismatches? HTTP has HEAD to allow a user agent to get the metadata if needed, and other protocol's lack of similar support is the problem for the protocol to solve, not us. (Which is pretty much what TimBray said above, and what the opening paragraph of this section says. But maybe it's clearer, I don't know.)

[MikeWarot, RefactorOK] While it's nice to have length, I don't think length alone is sufficient. I believe you need to have all of these for describing external content:

Content length (in bytes)
Content MD5 checksum (in Hex)
Timestamp of the time the checksum was computed

Other nice to have information:

CRC32 checksum of the data
Offset and length information (so you can pluck the guts out of a static web page, if necessary
URI pointers to a cache of the information, if the source goes 404, or censored.

MIME and URI media types

Would the support of Mapping between URIs and Internet Media Types make it easier to define content types in support of hierarchical relationships, internal "plain text markup" schemes, or other extensible content types?

Multiple top-level content elements

[DonPark DeleteOk RefactorOk] Isn't order of appearance enough of a hint? BTW, +1 on Ken's proposal to add 'encoding' attribute to 'content' element. For maximum flexibility, we could introduce multi-stage transform like XML-DSig but that is an overkill for ((Echo)). This is weird, I wrote this in response to questions about how to figure out author preferred content type among multiple content types in the feed. On another page, my entries got deleted outright. Zeesh.

MoinMoin seems to have a race condition that shows itslef during periods of high editing. It tries to give notice of changes, but fails to see simultaneous changes.

[HaroldGilchrist, RefactorOk] Do we need an optional "primary" content attribute for content? With the great possibility of having more than one referenced content type per entry, the primary content attribute could designate the content that is the central content to the entry message. Example: One thumbnail image, one larger image file of the same subject. If we designated the thumbnail as the primary content, the viewer could display the thumbnail and include links for the other larger image and any other referenced content.

[KenMacLeod, RefactorOk] My preference is for one content item only, which may contain a content type of multipart/alternative. I'm not aware of any precedence rules or preference parameters. Towards the end of the definition above, it states that multiple content elements within an entry are to be treated as multipart/alternative.

[HaroldGilchrist, RefactorOk] "My preference is for one content item only". I see that in the open poll. What is you argument here against multiple content?

KenMacLeod

RefactorOk

WellFormedEntry

should

treated

that

may

[HaroldGilchrist, RefactorOk] Complicated clarification wears many faces in this situation, complicated syntax only being one of them. My concern is mainly directed at ease of deployment and uniform adoption by ECHO feed software.

If the situation is that type multipart/alternative and multipart/mixed will be a requirement of ECHO 1.0 then the ECHO feed software vendors will all support its use with their best effort. But if it is left optional in ECHO 1.0 and treated just like other optional content mime types (which in a pure media type sense it isn't) its best effort support is questionable. If left optional is the final decision, I would favor allowing repeatable content to address my stated concerns.

[SteveKirks, RefactorOK] Ken and Harold, with regard to multiple content items, I give handheld devices like cell phones as the example. The reader on the phone could intelligently determine which image to download based on the feed's content

[HaroldGilchrist, RefactorOk] I guess we could use "order of appearance" for precedence rules.

[KenMacLeod] Checking the specs re. multipart/alternative, RFC2046 says:

"Systems should recognize that the content of the various parts are interchangeable. Systems should choose the "best" type based on the local environment and references, in some cases even through user interaction. As with "multipart/mixed", the order of body parts is significant. In this case, the alternatives appear in an order of increasing faithfulness to the original content. In general, the best choice is the LAST part of a type supported by the recipient system's local environment.

[HaroldGilchrist, RefactorOk] This would seem to suggest (even though Freed and Borenstein probably were thinking variation meant different text types, not multi-media), if I have audio content and text content in (just using different medium) content of type multipart/alternative with the audio appearing last, the recipient system should understand that the entry prefers to be offered as audio.

If we use "multipart/mixed" for this example, the rule on "faithfulness to the original content" goes away and the order of appearance is still inportant but could have a different meaning defined elsewhere and not by the spec.

issue: dangers of html

HTML is often viewed as a form of content, but in reality it mixes content aspects with presentation aspects and perhaps even a bit of running code. This can pose a problem unless the recipient is very careful to filter out the undesirable bits. Such filtering poses a number of pragmatic implementation issues given the loose syntax rules for HTML and inconsistent implementation. Ensuring that such content is well formed (with characters properly escaped, tags perfectly nested and closed) eases these implementation issues.

Still, HTML is by far the most popular format for entries with most being written in it and nearly all being displayed in it. And much of the HTML that people write is not well formed. For instance, it is very common to find a naked & in URLs. It's the CGI standard, but to be correct HTML it is supposed to be escaped.

Another danger of HTML, in any form, is entities. Things like   need to be declared otherwise you end up with non-well-formed XML. So the choices seem to be either supply a DTD, restrict the HTML used so that it doesn't contain any entities beside the base ones given in XML, or stuff it in a CDATA section.

[SamRuby, RefactorOk] I do just fine with the &#dddd; syntax. See clean.

AsbjornUlsberg

RefactorOk

[JoeGregorio, RefactorOk] Raises a good question, is HTML without the entities really HTML? I did a little digging and was suprised to learn how many entities are defined for HTML. http://www.w3.org/TR/REC-html40/sgml/dtd.html

[TomasJogin, RefactorOk] Languages -- other than English -- often use entities for characters not in UTF-8, like ä (�) and å (�). So, no.

[BillHumphries, RefactorOk] This becomes a headache quickly. It'd seem that the format would want to avoid Namespaces and [EntityDefinitions] so that it could be parsed by non-validating, non-namespace aware parsers -- of which, everyone's bound to have one lying around. However, XHTML seems to be the right format for the text media type, as I'd think any subset would be restrictive.

[MikeDavies, RefactorOk] Is restrictive a problem? (Keeping in mind the aim of a minimum specification - using the full HTML syntax could be optional, but mandate at least to provide a simplified set of elements and entities if the html media-type for content is used).

[JonathanSmith, RefactorOk DeleteOk] More discussion off wiki about entities. DonPark writes:

Don Park's Daily Habit

2003-07-29 SamRuby comments:

DonPark

High performance Atom processors can ignore the DOCTYPE and unravel them using an internal character entity map. General XML processors can process Atom files with a validating parser.

Applications using non-validating parsers will run into trouble, but a tool can be written that allows one to preprocess the feed to convert the named character entities to numeric entities.

You might also want to get some input from XHTML WG since they are probably interested in XHTML embedding use-cases like this."

[AsbjornUlsberg] Can't we reach a consesus on this where numeric character entities (&#nnnn;) is preferred (default), and named character entities are legal, but not recommended? A DTD should be provided in the latter case, but won't be needed in the first. I think not having to use a DTD gives people a reason to use numeric entities over named ones.

[JamesAylett RefactorOk] We need to be very clear on the implications (please forgive and correct the following if my terminology isn't quite right). In order to allow named entity references, you must declare them, either in an internal DTD subset or by an external DTD reference. At least the latter requires a validating parser, and the former is quite a burden on the content producer. If you validate, throw away any hope of using extension elements from different namespaces, which in my mind negates a lot of the potential of having this one format to rule them all in the first place. So I agree with Asbjorn: define a DTD in the spec so people can use named entity references if necessary (eg it may make converting legacy feeds easier, where funky extensions are less important anyway), but the norm should really be to use numeric entity references (or a different character encoding), and certainly all consumers need to be able to process this. (NewsMonster falls foul of MovableType and anything else that builds XML by templates and so can end up using undeclared named entities in its RDF feeds; having numeric entity references as the norm should help the ViewSourceClan behave themselves in future.)

[AsbjornUlsberg] Good summarization, James. +1 Should we create a NumericVsNamedEntities page to poll this and reach consensus?

corollary: true content vs. template

The "true content" of an entry is usually the part that gets flushed through a template to appear at the end of a PermaLink. Internally, this is represented in tool-specific syntax. Externally, it is represented in an exchange format (often HTML).

It may be desirable for an external model to be able to link to an externalized representation, ie. without having to either embed it or find it inside of the template at the end of a permalink.

[JoeGregorio, RefactorOk] This format will be used not just for syndication but for publishing also, so it is important to allow full content.

issue: hierarchal relationship between content items

[ShelleyPowers] A weblog entry can also be a parent to other entities, each of which can also contain links, audio, video, etc. which can also be parents to other entites, and so on. See Related. [MarcCanter] This might be where some link up with the ThreadsML effort can happen.

[DannyAyers] But it's not necessarily hierachical, e.g. a single post can summarise several threads - many parents, one child. Using a tree model would prevent a range of dialogue approaches, e.g. thesis, antithesis -> synthesis. Needs to be a digraph, IMHO, and mainly for this reason (it would complicate Necho syntax) I reckon such relationships shouldn't be in the core.

See content, ConceptualModel, and ContentAndPermalink.

CategoryMetadata, CategoryModel