Proposals
A. Escape everything.
-
<content><em>Ben &amp; Jerry&apos;s</em></content>
-
XML parser-based tools receive a string for the entire content
-
supports non-well-formed content, providers do not need to "tidy"
-
literal markup is passed unchanged through intermediaries
-
much easier to parse, clients don't have to convert to XHTML
<content><![CDATA[<em>Ben & Jerry's</em>]]></content>
Votes: AaronSw, BrentSimmons, GarrettRooney, DeveloperDude, BenAdida, LeonardoHerrera
[BrentSimmons, RefactorOk] I don't like in-line: I prefer escaping or CDATA.
Reasons:
1. The chunk of data in between <content> tags really is a chunk of data. I can't think of any earthly reason why the parser wants to deal with it as anything but a single chunk. It may be that you'd want to parse it later, somewhere else in your app, for some reason, but not when you're building an array of weblog entries. You just want the string.
2. It needs to be well-formed, and that's never going to happen. Not now and not in five years. It may happen 99% of the time, as publishing tools get good at it, but anyone parsing this stuff is always going to have deal with non-well-formed content. It's going to be a huge headache. That's already a problem with RSS, but this will make it worse. (The percentage of non-well-formed feeds will go way up.)
In fact, what I'd probably do is pre-process the feed, wrap inline stuff as CDATA, before passing it to an XML parser. This is a stunningly ugly thing to have to do.
But I'd do it for both reasons #1 and #2: because I want the content as a string, not as a tree, and because I want to deal with non-well-formedness.
[GarrettRooney RefactorOk] Given what BrentSimmons says about in-line being hell for consumers of the data, and the annoyance value of requiring the content to be escaped, it seems like CDATA sections are the lesser evil here. I don't particularly love the idea of wrapping all my content in CDATA, but I like it a lot more than the ideas of reading escaped HTML in one case, or trying to require correct markup and then dealing with the fact that there will always be incorrect markup out there in the other.
[BenAdida] We should learn from the End-to-End design principle of Internet protocols: keep the format simple, don't discriminate between potential uses. Specifically, an Echo reader program's job should be to parse Echo stuff (hopefully assuming very little other than a super-simple Echo XML schema) and pass the payload up to the calling application. The only way to stay simple is to have one implementation method, and the only way to be content-neutral is to quote everything. Anything else means we think we know exactly how Echo will be used from now until the end of time, and that's just ensuring we will miss opportunities to build new, interesting things in the future. We must be prepared for what we haven't yet invented.
[LeonardoHerrera] I want the ability of write a basic non-echo reader without deep thinking. I'm a lazy man, give me a CDATA element, I'll extract the contents and throw it to a display window. Give me encoding and type attributes, and I'm all set: <content type="holograph/animated-no-artifacts" encoding="bork-bork2"> will be ignored by my parser, putting a placeholder instead (something like this content cannot be displayed by this little program.)
B. Support inline, escaped, and base64:
-
<content><em>Ben & Jerry's</em></content>
-
XML parser-based tools parse content in one pass
-
supports potentially non-well-formed content, via an explicit declaration
-
tools can write plain text and XML inline
-
direct support for XSLT, XPath, CSS/XSL, XForms, and InfoPath
<content mode="escaped"><![CDATA[<em>Ben & Jerry's</em>]]></content>
Note: base64 another content encoding, but is intended for use by primarily binary, non-*ML media types.
-
<content mode="base64">PGVtPkJlbiAmYW1wOyBKZXJyeSZhcG9zO3M8L2VtPg==</content>
Votes: MarkPilgrim, JoeGregorio, JeremyGray, SamRuby, TimothyAppnel, DareObasanjo, ChrisWilper, ArveBersvendsen, TomasJogin, DiegoDoval, TimBray, KenMacLeod, DaveWarnock, UcheOgbuji, LachlanCannon
[MishaDynin, RefactorOk] This is great for xhtml. What is the default mode for text/plain? "xml" doesn't make sense, and default mode shouldn't depend on type.
-
[SamRuby] Misha, I think your question gets to the heart of why option B is needed. The following two should differ ONLY in mime type:
-
<content type="text/plain">Ben & Jerry's</content>
-
<content type="text/html" >Ben & Jerry's</content>
-
[TomasJogin] The "mode" attribute does not reveal the MIME type. "Xml" makes sense as this means that the content is well-formed xml and therefore does not "need" to be escaped. Suggestion: make "mode='xml'" the default and satisfy both the people who want it to be required *and* those who advocate the alternative above.
-
<content type="MIME/type" mode="...">...</content> where the MIME-type given describes the chunk of content after it is un-moded?
-
[DylanMoreland, RefactorOk] Doesn't the first, unescaped, option signify, per the namespace spec, that em is in the echo namespace? Wouldn't that be true for any XML content stuck in there? Without the namespace I'm not sure what the Echo consumer is supposed to think of XML content. <echo:content xmlns="...">...</echo:content> would establish a default namespace for all enclosed elements, but you would need the prefix and that seems awkward in a document where everything else defaults the namespace (you would also need another namespace declaration). Alternatively, you could mandate that there be only one child of content (such as <body xmlns="...">), and that could contain the declaration and provide a valid container, (see Arve's 3rd example further down the page).
-
[SamRuby] lets try to keep the three examples in parallel? Either add the namespace to all three, or leave it off in order to focus on the differences. I prefer the latter, but am OK with the former?
-
[DylanMoreland, MichaelBernstein RefactorOk] So we're saying that for XML content there can only be one child element of content? Since we're allowing things other than HTML, namespaces become a big concern and I want to nail this down in whatever specification for which we're voting. Also: which element gets to be this "one child"? The ones that hold meaningful content by default (xhtml:body, svg:svg, m:math, etc)?
-
Although it needs to be made clear what is allowed, there is no technical reason why each of several "top" level elements couldn't have namespaces (a sequence of <xhtml:p xmlns:xhtml="..."> for example).
-
For which top level element should be used, the spec should be clear on the intended usage: if <xhtml:body> is used, for example, the reader must take the children of the <xhtml:body> to render in their own page. if <xhtml:div> is used, by comparison, the reader can use the <xhtml:div> as-is. The Atom spec should be very clear on which practice is chosen.
[DareObasanjo] I assume the namespace was left off because this is an example and not because all the markup within the content will be from the Echo namespace. I have updated the example above to use the XHTML namespace.
C. B, but with mode="xml" to be required
-
JoeMadia, AsbjornUlsberg, BradFitzpatrick, MishaDynin, MSM, MortenFrederiksen
[JoeMadia] I am comfortable with either B or C but would prefer C since it provides more semantic hinting in the most common case. If we go with B then it feels to me like we're burying a bit complexity here instead of making it explicit and obvious to all consumers. Every Echo tool will need to handle all 3 types of escaping so it's not like either option is making anyone's life any simpler. And it's only 11 characters per item... what a bargain! Does anyone see any downsides to C that I'm missing here?
-
[SamRuby] If there is a name for the mode, I would prefer that it be called "unescaped". The purpose of the mode attribute is to indicate whether the content needs to decoded once or twice (or as base64). Let's look at a few examples:
-
<content type="text/plain" mode="unescaped">Ben & Jerry's</content>
-
<content type="text/html" mode="escaped">Ben &amp; Jerry&apos;s</content>
D. No, I want something else entirely
-
[MSM] Specify MIME-type, particularly for base64 content.
-
[AsbjornUlsberg, RefactorOk] I really hope that "type" is forgotten in the example, and that it isn't decided that it should be dropped. It's especially needed with base64-encoded content, but also in other cases, like when content is referring to external recourses etc.
Considerations
-
People don't read specs carefully, instead they view source and emulate. And when they emulate content that is escaped without a clear signal, they emulate poorly.
-
Huh? What's that supposed to mean? Can you name an actual problem? Everything is escaped; isn't that clear?
-
My _Ben & Jerry's_ example illustrates the problem. I escaped it properly, but it was "corrected". So the answer is: no, it isn't clear to 99.44% of the population. [SamRuby]
-
Wellformed HTML allows us to easily address such issues as relative URLs.
-
How?
-
Requiring consumers to consistently interpret ill-formed HTML correctly is a high barrier to entry.
-
How is this relevant? We're not discussing whether we'll allow ill-formed HTML; we're discussing whether to include HTML inline or quoted.
-
Both escaped and inline should be supported. At best, producers of escaped html should hope for is that consumes will sanitize their content (perhaps by removing all tags, such as some aggregators do today with titles). A higher bar is possible with well formed HTML.
-
This is absurd. Many aggregators support malformed HTML; they will continue to with Echo.
Discussions Elsewhere
See also content, this thread, MimeContent.
Should HTML content be escaped (nee quoted) or inline?
Discussion
Proposal AaronSw: Everything is escaped.
Commentary Bray: (at greater length below.)
-
I favor signaling, either with a different <content> element or a true/false attribute, whether or not unescaping is required before <content> is dispatched to whatever is going to handle it.
-
But, check out Brent Simmons' remarks & my response below.
-
And, tying the escaping level to the type= is attractive but doesn't quite work
-
And, please, we're talking about escaping not quoting or encoding
[TimBray] Sam and I are converging: I think he's captured the three interesting cases, but I'm not 100% that elements are the right level. The following feel a bit more idiomatic to me, but now I'm off to sleep on it:
-
<content><em>foo</em></content>
-
<content mode="escaped"><em>foo</em></content>
-
<content mode="base64">PGVtPmZvbzwvZW0+</content>
-
+1 [JoeGregorio]
-
+1 [JeremyGray] I've read this page more than a few times now and keep finding myself back at Tim's suggestion. Keep the 'mode' list _incredibly_ short, recommend CDATA for the 'escaped' mode (even if that causes 'escaped' to be renamed for best clarity), and you'll have my vote.
-
+1 [SamRuby] Close enough. I'm in.
-
+1 [TimothyAppnel] This is only under the condition that the only escaping supported is CDATA and not entity enoding as the xample above illustrates.
-
+1 [JoeMadia] I'm in as is. However, I would prefer that the mode attribute be made required in all cases. The current no-mode case could replaced with mode="xml" or some other value. This would provide more semantic hinting for developers new to the spec and to developers from the ViewSourceClan.
-
+1 [MishaDynin] Slight preference for making "mode" always required -- but I'm in either way.
-
+1 [DareObasanjo]
-
+1 [ChrisWilper] Because what's going on is clearly evident by the syntax.
-
+1 [ArveBersvendsen]
-
+1 [AsbjornUlsberg] I'm not sure wether I like the word "mode", but it's better than many of the alternatives. I also think this attribute should be required.
-
+1 [MSM] Looks good. With the base64 encoding of binary data (which seems very useful, despite its potential for abuse), won't the consuming application want to know the MIME type of the content? If the consuming application treats the contents of <content></content> as an atomic unit (whether it parses it inline or defers interpretation), wouldn't the MIME-type be useful in an advisory sense anyway?
Withdrawn Proposals
Proposal SamRuby: withdrawn (See this thread) There are three forms of expressing content:
-
Illustrated by example:
-
<content><xml><em>foo</em></xml></content>
-
<content><escaped><em>foo</em></escaped></content>
-
<content><base64>PGVtPmZvbzwvZW0+</base64></content>
-
Rationale:
-
In my experience, people don't read specs carefully, instead they view source and emulate. And when they emulate content that is escaped without a clear signal, they emulate poorly.
-
I'd like to get to the point where the original functionality of the RSS 0.90 link tag can be achieved with the xpath expression "//a/@href" on those feeds that have well formed HTML.
-
If you are a user of a recent version IE or Mozilla, you already have a validator for wellformedness.
-
Making the signal an element instead of an attribute makes life easier for both tag soup regex based approaches as well as validated schema based approaches.
-
Ultimately, I would like to be able to move on to discussing such things as how relative URLs are to work, and I fundamentally believe that programmatic adjustment of content which is not well formed is an unsafe proposition.
-
+1 [TimBray] Close enough. I'm in. (seriously guys, we're down to hairsplitting here. Whatever.
Proposal DareObasanjo: There are two forms of escaping: I'd like to withdraw this proposal [DareObasanjo]
-
XML (written and read when the type attribute ends in "+xml"): XML markup is kept inline, ala <content type="application/xhtml+xml"><em>foo</em></content>
-
Everything else: markup is escaped once and not indicated, ala <content type="text/html"><![CDATA[<em>foo</em>]]></content>
-
Supporting arguments at http://www.intertwingly.net/blog/1500.html#c1056803387 and http://www.intertwingly.net/blog/1500.html#c1056814829
Proposal KenMacLeod:
-
[KenMacLeod] I'd like to withdraw this proposal. Swapping "escaped" for "literal" (literal becomes the default) is better moving forward, and that make's it identical to Tim's proposal. See Escaped HTML discussion for more discussion about using an element or attribute for signalling encoding, the difference between Tim's and Sam's proposals.
-
Illustrated by example:
-
<content encoding="none">foo</content>
-
<content encoding="literal"><em>foo</em></content>
-
<content encoding="none"><em>foo</em></content>
-
<content encoding="base64">PGVtPmZvbzwvZW0+</content>
-
Rationale:
-
content value is character data, unless
-
content encoding is set to 'literal', in which case it may be XML elements and character data, or 'base64', in which case it is encoded using Base64, and
-
the determination of how to present the content is based on the content type.
This differs from DareObasanjo's proposal in that Dare's determines the encoding based on the content type. See TimBray's comment above re. tying escaping to type.
[SjoerdVisscher, RefactorOK] Re. encoding="none" instead of encoding="escaped": The term "escaped" might give the prorammer the idea that he has to escape something. But unless you are creating XML without XML library, you don't have to do anything.
-
In DOM: content.appendChild(document.createCDATASection(htmlString))
-
in XSLT: <content><xsl:value-of select="$htmlString" /></content>
Discussion Summary
Escaped:
-
Pro: HTML should be escaped for symetry with other formats.
-
Pro: Supports all forms of HTML.
-
Con: It's ugly ("a really horrible kludge").
Inline:
-
Pro: It'll force tool developers to make sure everyone writes well-formed XHTML.
-
Con: It'll force tool developers to make sure everyone writes well-formed XHTML.
-
Pro: DOM, XPath, and XSLT already have the content in an element node.
-
Info: SAX applications can swap in a DOM-building or XML-writing handler to select either form.
-
Con: Can be difficult to recover with many tools.
-
Define "recover" and its difficulty.
-
Recovering means getting escaped content into unescaped forms. Difficulty: many tools, particularly XML parsers, provide no mechanism for doing this, simply outputting escaped content as CDATA - perhaps as they should.
-
Con: Some widely used tools (like MSHTML) put out malformed XML.
-
Con: Old entries can't be made well-formed, but must be exported.
-
Con: XML-writers need special logic for XHTML (e.g. make <hr/> into <hr />) New!
-
Info: many XML-writing libraries support this option.
-
Pro: Can be validated using schemas.
Determination based on content type:
-
Con: Difficult for schema languages like RelaxNg or XSD to property process and/or validate the data. Don Box
-
Actually RELAX NG can handle this just fine [DareObasanjo]
Further Discussion
[TimBray RefactorOk] Having read all this, it seems that it's not that complicated. One of the two following is true:
-
The content of <content> is data whose media-type is given by the type= attribute. There may be a requirement in some cases for base64 or other encoding, but RSS got by fine without that so we shouldn't try to invent it. The receiving software either is willing to try to do something with the given type or not.
-
The content is text (type="text/plain" and contains markup that is XML-escaped, and we'd like downstream software that is capable of this to unescape and use that markup.
Refusing to deal with the second case is attractive but stupid, as we can't ignore the legacy problem. Forcing all markup to be encoded to cater to the legacy is bad design. Thus it seems to me like the only plausible solution is one of the following
-
have separate elements for <content> and <emcontent> (escaped-markup content)
-
have the same element and signal whether there's escaped markup with an attribute, em="true|false", and pick one of
-
the default is no escaped markup
-
the default is escaped markup.
Let's pick one of these and move on.
[KenMacLeod] Is this the same or different from <content type="text/html" encoding="none"> (which means that standard XML escaping is used) and <content type="text/xhtml" encoding="literal"> (which means that XHTML content is parsed).
One part that confuses me is I seem to see a case where type="text/plain" but it's "still" text/html being sent and the reader somehow has to figure that out. That may be what em="true|false" means, but if you know that it's escaped markup, why not use type="text/html"?
Example, CDATA (1):
<content type="text/html"><![CDATA[ <p>Hello, <em>weblog</em> world! 2 < 4!</p> ]]></content>
Example, escaping (2):
<content type="text/html"> <p>Hello, <em>weblog</em> world! 2 &lt; 4!</p> </content>
Example, inline with default namespace (3):
<content type="application/xhtml+xml"> <body xmlns="http://www.w3.org/1999/xhtml"> <p>Hello, <em>weblog</em> world! 2 < 4!</p> </body> </content>
Example, inline with namespace (4):
<content type="application/xhtml+xml" xmlns:x="http://www.w3.org/1999/xhtml"> <x:p>Hello, <x:em>weblog</x:em> world! 2 < 4!</x:p> </content>
[ArveBersvendsen, RefactorOk] I've provided four different samples. My personal view is that (1) or (2) should be a MAY only where the content-type is text/plain or text/html. For application/xhtml+xml, they should be noted as SHOULD NOT. For application/html+xml, either of (3) or (4) should be marked as MUST, and (1) or (2) as MUST NOT. [TimBray] Your (3) doesn't work, because there's no <content> element in the HTML namespace. You could put a <div> in or something. [ArveBersvendsen] I changed this to use <body>, which I believe is cleaner.
[TimBray RefactorOk] Summing up a conversation between myself and AaronSw (I think) here: I find interpreting escaped content kludgy and horrible; among other things, readability is severely impaired. On the other hand, if you allow for escaped tag soup, then people can use Echo to archive their last five years of ill-formed postings. So it seems that at least the option of escaping content should be allowed. On the other hand, I predict that in five years time, Echo will still be very popular, and at that time, the notion of generating non-well-formed markup will feel archaic and barbaric, and people will really wonder why we are making them add this ugly level of overhead. My tentative conclusion is that we should have an optional markup="escaped" attribute on the <content> element, for when you need to do this. But the default action should be to do the right thing, which is to generate well-formed content. Now we can restart the argument Other voices?
[AaronSw] How is readability impaired with CDATA?
[TimBray] Specific answer: it's basically not OK to mandate the use of CDATA. For details, see section 4.3 of RFC 3470 (if you poke around, you can find an easier-to-read HTML version), also known as IETF Best Common Practice #70. By the way, anyone who's planning to anything serious with XML should study that document, it contains a lot of highly concentrated wisdom.
-
[LeonardoHerrera RefactorOk] Agree with CDATA not being obligatory. My position is, if you don't need to escape anything (ie, your "contents" are valid xml) then don't. If you have anything funky (*dodge*) that needs to be escaped, then enclose the whole node in a CDATA element. Easy to produce, easy to parse.
-
[MishaDynin RefactorOk] what about a required encoding={inline,escaped,base64} attribute? With a guideline that content providers should inline well-formed HTML whenever possible.
-
[AsbjornUlsberg RefactorOk] We can't at least call the attribute 'markup="excaped"'. This kind of naming gives us tons of attributes and elements, which is no good to anyone. We need a strict and tight set of elements and attributes, with a wider variety of values. It's the number of values that should be rich, not the number of attributes and elements. We need to look at other standards, and how they have done it. SOAP would be a good template in this context. The attributes and elements need to be as generic as possible, so that when extension is needed, it is done with new values and not new attributes or elements. This is extremely important, so we don't end up with a bloated standard within 6 months after it's release. Changes and new requirements will come, and therefore we need to make sure that they can be implemented as easy as possible.
-
[ChrisWilper RefactorOk] Yes, there is absolutely a use case for syndicating non-well-formed content, but I disagree that we should assume it's text-based. But sometimes base64 is just too much. For that reason, a pass-by-reference (i.e. URL) should be supported in place of the content at the syndicating tool's option.
-
[GeorgBauer RefactorOk] I think that we need encoding designations in the content, because people will even in 5 years want to put non-XML content in there. For example text/plain. So I think we should allow content to carry a type and an encoding to tell consumers what to expect and how to parse it. This plays nice with other standards, too, like the MIME stuff for eMail. So for example I could set up a RSS feed for my inbox and just throw in messages with type="message/rfc822" and encoding="base64" or something like this, if I want to. Restricting types and encodings of content too early might be one of those areas where we get hit with big cluesticks by the users, if we don't get it right
-
[AsbjornUlsberg RefactorOk] Text/Plain won't be much of a problem in XML (it usually won't have to be encoded), but other than that, you're right.
-
[ChrisWilper RefactorOk] Hmmm... on the base64 stuff, maybe it's best just to refer to the resource via URL if it's non-base64. Or require that any character streams in base64 include the character encoding parameter part in the mime type.
-
[AsbjornUlsberg RefactorOk] I agree. It's stupid to put binary data in XML, or on the wire at all -- especially if it isn't optional, so all binary resources should therefore be referred via URLs.
-
[TomasJogin RefactorOk] Yes. I agree, this is a good idea.
-
[DannyAyers RefactorOk] Using an attribute seems a good compromise, yes. Now what about inclusion and interpretation of other forms of non-core stuff such as RSS (1.0 and 2.0), Creative Commons, ebXML, whatever. See ExtraInterop.
-
[ArveBersvendsen, RefactorOk] My view is that we should try to avoid entity-escaping whenever possible in favor of CDATA. CDATA may look ugly, but it's an unambigious way of telling the parser that the data within might not be well-formed.
-
[ArveBersvendsen] Right now, the current EchoExample seems to dictate CDATA or character escaping, a move I find inherently bad: If an entry can be expressed as valid XML, it should be expressed as just that, XML. Preferably by declaring a default namespace for <content>.
-
[AsbjornUlsberg] I think this is very basic, and should be understood by anyone that has ever worked a bit with XML, and not to mention XSLT. Doing XSLT on escaped XML or XML inside CDATA is just extremely painful, and not to mention: stupid. Why would anyone want to do that?
-
[DeveloperDude] CDATA +1
-
[KenCoar, RefactorOk] sorry, i guess i'm swimming against the tide here. i have a strong dislike for allowing arbitrary elements in a defined xml structure. i'm for encoding, period. it also makes the work of the receiving engine a lot simpler.
-
[AsbjornUlsberg] Continue swimming. I can't see any reason or use-case where encoding everything is necessary or preferrable. Encoding is a final solution; something you do when your content don't fit into the format your putting it, for whatever reason. This goes for formats and situations as XML, databases, etc. Escaping is the final, not only, way out.
-
[PeteProdoehl] Anything but escaping! CDATA works for me.
-
[JonathanPorter, RefactorOk] I'm not sure if any of the examples are really worthwhile as they need to be specified much better. Examples 1 and 2 woud be useful except that neither actully qualifies as HTML. I beleive at minimum you have to have a "title" element. The XHTML examples exhibit the same problem and an additonal one. As Tim stated you have to be able to distiguish between inline XML and other kinds of data. The XHTML examples like the HTML example do NOT qualifly for the the minimum specification of 'application/xhtml+xml'. The more appropriate thing to do is make it a valid XHTML document then encode it or put it into an CDATA section. Or make special cases for XML based data and with namespaces. So two points to keep in mind XML based data (documents) vs inline XML and data (that matches the media type) vs. a fragment of that data.
-
[NormanWalsh, RefactorOk] On escaped HTML: this has all gotten way too complicated. My vote is for a very simple core that allows only well-formed XML. If someone wants to define an extension for dealing with non-well-formed HTML or binary data, more power to them. The core Echo namespace doesn't have to deal with them.
-
[Arien, RefactorOk] +1 I agree 100%
-
[NormanWalsh, RefactorOk] On CDATA: if your application cares about the difference between <![CDATA[<]]> and <, your application is (1) an editor, (2) doing digital signatures or something like that, or (3) broken. You can't mandate how I escape stuff.
-
[LeonardoHerrera RefactorOk] You are right about broken applications. The truth is, applications are inherently broken; and this is particularly true for older applications. If we want ((Echo)) to be adopted, then we must provide ways to easily migrate existing content.
-
[JeremyGray, RefactorOk] I'm generally with Tim in that I too see his three cases as distinct. I'm not sure, however, that it is a good idea to promote anything other than well-formed XML. Base64... CDATA'd or escaped unbalanced HTML... Escaped content of arbitrary format X... Its a pretty slippery slope with lots of room for fragmentation best avoided if at all possible. If the list could be kept restricted well enough, I wouldn't oppose it, but I at least wanted to issue my concern re: the slippery slope. ChrisWilper's comment regarding referencing external content does have some attraction to it, and I might even consider extending that to any and all non-XML content. As for CDATA and/or escaping, I'm not really a big fan of either, but could live with them if necessary.
([JeremyGray] As a quick note regarding the above paragraph (I'm adding this a day or two later), I'm going to back off on the referenced content argument a bit as I am much more concerned with the escaping and canonicalization issues that will have to be sorted out for in-line content regardless of whether or not referenced content is supported at some point in time.)
On the subject of well-formed XML, (and please point me to the correct Wiki page if I've missed it) there is one area I haven't yet seen people mention yet: canonicalization of XML for the purposes of extraction from the surrounding echo XML for display, persistence, etc., specifically with respect to things like external entity references, namespace declarations, etc. Any current thoughts on this issue? I have no hard and fast preferences at this time, but could easily see it jump up and bite us pretty quickly and figured it deserved mentioning. Generally speaking, I'd prefer to see content namespaces declared locally (for easy hacking of the content out from the rest of the XML) and external entity references avoided wherever possible, but if full-on canonicalization is required, so be it. (I'm just not sure that such a requirement would do great things for rapid, wide adoption.) EDIT: XML Canonicalization experts - 'canonicalization' seemed to be the best term that came to mind at the time, but if I've mistakenly abused the terminology, please let me know.
[DaveWarnock, RefactorOk] I suggest we allow XHTML only. Where your content is not well formed XHTML then you use a standard snippet of XHTML which includes a url for your content. If that url returns the correct content-type then normal standards will control what the client does with it. This should be simple for people with loads of older content while keeping the standard very clean for the long term.
-
[ArveBersvendsen, RefactorOk] -1. This would require extensive rewrites, both of templates, CMSes and aggregators.
-
[AsbjornUlsberg] True, but I understand and almost agree with Jeremy and Dave on this. Non-XML content can be referred to, but not via XHTML syntax. If non-XML content should be referred, it should be done in the standard way, via a <uri> or <href> element inside <content>. I see absolutely no reason for adding ad-hoc methods for doing external referring in this context, when we have (or can have) a standard syntax for doing it in other (e.g. all) contexts.
[BillHumphries, RefactorOk, OutrightDeletionNotOk] While I'd like to be stern and strict, I'd recommend that the default payload is assumed to be XHTML unless there's an encoding attribute. If the content is not XHTML, escaping while regretable, is preferable. When working with XML returned from a certain large search engine company, the descriptions of pages are in a node as escaped HTML, and one can write:
<xsl:value-of disable-output-escaping="yes" select="foonode" />
-
[DaveWarnock, RefactorOk] What happens when the original html contained escaped content? I guess it now becomes unescaped which is a bit of a pain. Is there a way to essentially double escape what was originally escaped? If not the use of the external link still works for me (whether via a XHTML snippet as I suggested or as a <uri> as Asbjorn suggests).
-
[AsbjornUlsberg] Escaping and unescaping can be a bit of a pain. But when you have a set of rules on how to do it, it's rather easy. Then you can escape one document 100 times and unescape it 100 times back, and it will look the same. The key is not to escape or unescape too much, and to do it in the right order. Let's take a fairly easy HTML snippet as an example:
<p>Echo rules! 4 > 2. H&M is a clothing store.</p>
Here, > and & are already escaped to > and &. When escaping this HTML snippet 1 time, we get:<p>Echo rules! 4 &gt; 2. H&amp;M is a clothing store.&lgt;/p>
When escaping, it's important to escape every & first, since & is the escaping character. If you escape brackets first, you'll end up escaping < to &lt;, which of course isn't rihgt. Now let's escape it just one more time:&lt;p&gt;Echo rules! 4 &amp;gt; 2. H&amp;amp;M is a clothing store.&lgt;/p&gt;
Now, it's only the amps that are escaped, since we don't have ane <'s in our snippet. To unescape this back, we will do it in the opposite order, or else we will unescape too much. We'll unescape < and > first, and then the &'s:<p>Echo rules! 4 &gt; 2. H&amp;M is a clothing store.&lgt;/p>
Then we do the same once again:<p<Echo rules! 4 > 2. H&M is a clothing store.</p>
And we're back to where we started. If anyone isn't sure about what escaping implies, it is to replace a set of characters with another set of characters. In many languages, \ defines an escape sequence. In HTML & does the same. Characters that needs to be escaped in HTML and XML are &, < and >. These character's escape sequence are respectively &, < and >. Just mentioning. [MSM] Also ' (') and " ("), as long as we're reviewing XML.
[SeanMcGrath, RefactorOk]XML's three special characters that need to be escaped can be worked around by adding elements called amp, lt and gt. I use this a lot in XML vocabularies because of the "2 to the n-1 ampersand" escaping problem (I call it 'ampersand attrition' in an article I wrote on the subject: http://www.itworld.com/nl/xml_prac/07042002/). Obviously it does not help in literal chunks of markup but I suggest is worth considering for WF payloads.
[MSM, RefactorOk] I see the possibility of syndicating a lot more than just textual content that can be delivered in XML form. I rather prefer what's shown in EchoExample (as of this writing), with maybe encoding added (base64 or ), and perhaps allow for specification of content length (so a feed consumer can skip anything it doesn't want to deal with on a size basis), and also an optional external reference that can be used when the EchoFeed consumer doesn't know what to do with the content fragment itself (or prefers to pass fetching/processing off to some other app or OS service). I guess where I differ from most of the above is that I'd always put the content as CDATA, as the content pieces themselves are meant, I always thought, to be atomic units. After the EchoFeed consumer unpacks the feed, it operates on the content units as it sees fit -- it may display them inline in the feed consumer's application display, or may show them as attachments, may discard them if they fail some security measure, etc.
<content type="text/html" xml:lang="en-us" encoding="UTF-8" length="48"> <![CDATA[ <p>Hello, <em>weblog</em> world! 2 < 4!</p> ]]> </content> <content type="img/x-png" encoding="base64" length="31415" href="http://example.com/foo.png"> <![CDATA[ ... imagine base64 data here ... ]]> </content> <content type="img/svg+xml" encoding="UTF-8" length="2112" href="http://example.com/bar.svg"> <![CDATA[ ... imagine SVG document here ... ]]> </content>
[LeonardoHerrera RefactorOk DeleteOk] Ugh, that "length" attributes scares me. I can envision a non-stop flow of bad implemented "length" attributes; thus, nobody will rely in that datum. It's really necessary to include it? If not, I would prefer not to mention it at all, even if it is optional.
[JeremyGray RefactorOk] -1 for the length attribute being redundant given that in the example all of the options and their data have already been delivered, so why use anything but the largest and most pre-prepared version. An additional -1 for stepping even closer to the edge of the slippery slope called feature creep.
[AsbjornUlsberg, RefactorOk] I don't like CDATA'ing all content, nor do I like the length attribute. If we allow inline binary data, the length attribute is necessary, but I still dislike the idea of embedding images in XML. It's better to refer them extarnally then, imho.
-
[KenMacLeod, RefactorOk] Why would a length attribute be necessary? base64 encoding is non-lossy (when you decode base64, the length is known) and one can't do "inline binary" in XML without encoding of some type.
-
[AsbjornUlsberg] Well, I thought it would be nice to know the size of the embedded binary object before you did something with it.
-
[MSM] That's kind of what I wanted "length" to indicate -- if the value was above the feed consumer's threshold, it could bit-bucket the data (and the feed provider would, in they wanted to provide an alternative pointer, include the href as well? maybe). However, since A) people will lie about the length of content and B ) the length is there anyway, I can see doing without "length." No problem.
-
[ZhangYining] I might have missed something here, but binary content are inlined, would the consuming side have the option not to take it?
If content metadata is allowed (see dicussion of EchoExample), length should go there, instead of being an attribute of content element.
-
[MSM] I'd imagine the consuming application has the option to ignore anything it likes in a feed. And I imagine they will, in great and varying form. I see now that "length" of any content would be advisory any place it was put in a feed. True content length would be measured by the consuming application.
-
[MSM] Inline binary data can be displayed by the feed consumer even when the consuming device (distinct from the feed retrieval device) is not attached to the network. Anyone who drops very large binary data into a feed where the audience is neither expecting nor prepared for such will probably not keep that audience for long.
[TimBray RefactorOk] I take what Brent says seriously, but forcing authors to escape everything has a pretty severe price in both readability and writeability. Brent is pretty convincing that the consumers would be happier with everything escaped, so it's a matter of whether we care more about making things easy for hand readers/writers or authors processing code. Not a slam-dunk either way.
[AaronSw, RefactorOk] Huh? CDATA sections make quoted practically just as easy to read and write as literal. Can you seriously claim that:
<foo><![CDATA[bar]]></foo>
is so much harder to read and write than
<foo>bar</foo>
that we should make consumers go to horrible kludges?
[KenMacLeod, RefactorOk] It may be a development style. Far back into my SGML days playing with DocBook and mapping DocBook structure elements and mixed content into Perl objects, I would stash the mixed content as DOM-like (grove) objects. Today, I swap in a DOM-building SAX handler whenever I recognize I'm gonna have literal XML to preserve. People using DOM parsers, XPath, and XSLT already have the content as an element node. I believe it's these latter folks that benefit the most from literal or inline XML.
[JoeGregorio] As an aggregator builder I do see a use for both the inline and escaped content, mostly because Aggie uses a web browser as it's output format. Stripping 'insecure' tags and attributes from HTML is easier, and more reliable, using XPath+DOM than it is using regex's. I am now using regex's but will soon switch to using Tidy on the content and then strip it's output via XPath+DOM. This is where the 'choice' for CMS vendors gets involved, and why I think we need a solution that allows both in-line and escaped. If tools like TypePad can produce well-formed XHTML all the time then their content can be inlined and when I read a feed from them I don't have to do the 'Tidy' step first, and that is a big savings in processing time. Obviously this is a concern because I am using a browser as the aggregator display device, and if your not displaying in a web browser, then your mileage may vary.
-
[KenMacLeod] Note TimBray's comments further above re. "it's basically not OK to mandate the use of CDATA." There's no significant difference between CDATA and escaping.
[AsbjornUlsberg] It depends on how you see it, and not to mention who sees it. For the human eye, CDATA'd HTML looks much better than escaped. Other than that, I agree with Tim.
-
[GarrettRooney] That's exactly it. There may be no technical difference between CDATA and escaping, but I can read CDATA'd HTML without going nuts, which seems like a perfectly good justification for it's use to me.
[DareObasanjo] As an aggregator author all I can say is that Brent Simmons speaks for himself not for everyone who's building an aggregator. For instance, RSS Bandit allows users to create XSLT stylesheets that are used as themes over the content provided by the blog (screenshot1 and screenshot2) and I'd much prefer to consume well-formed XHTML and pass that to the XSLT engine as opposed to running the equivalent of HTML Tidy on the content every single time.
[HenriSivonen] Whether escaped (payload as string) is more convenient than inline (payload as subtree) largely depends on the interface between the syndication format processor and the content renderer. If I've understood correctly, the interface in NetNewsWire is that the RSS component hands the payload as a tag soup string to the Cocoa tag soup renderer. However, if one were to implement an aggregator over Mozilla (for example), the interface could conveniently accept a document tree. In such a case, the Echo component could pass a namespaced DOM subtree to the XHTML renderer.
Parsing tag soup is hard, so if the renderer interface wants a tree, parsing is non-trivial. On the other hand, serializing a tree to a string is easy. Therfore, in order to serve both kinds of renderer interfaces, it would make more sense to choose the payload as subtree model for the wire format instead of the payload as string model.
However, from the feed producing point of view, it is of course easier to spit out tag soup as string instead of producing proper XHTML document trees.
[DonPark DeleteOk RefactorOk] Allow me to make a proposal which might not be as flexible as everybody wants, but is simple enough to support legacy issues as well leaving the door open for the future.
-
we support existing mass of RSS contents through 'legacy' type.
<content type="legacy"> same as RSS 2.0 <description> value </content>
-
we support XHTML contents through 'xhtml' type
<content type="xhtml"> unescaped XHTML document fragment </content>
Implication: aggregators that support 'xhtml' type must pre-define XHTML character entities
-
we support plain text contents through 'text' type
<content type="text"> plain text content </content>
-
we support general XML contents through 'xml' type.
<content type="xml"> unescaped XML document fragment </content>
-
Other type values are reserved except for those that start with "x-"
All aggregators must support 'legacy', 'xhtml', and 'text' types. Rest are optional.
Unknown or unsupported content types are to be ignored.
-
[HenriSivonen RefactorOK] That does not follow from using namespaced XHTML elements and attributes. Also, entity handling is not in the jurisdiction of Echo if Echo is an application of XML.
The XML spec predefines a few entities. All other entities must be declared in the DTD. You can't just make up a requirement to predefine some additional entities and still expect to be able to use existing ready-made XML processors (aka. XML parsers). And the whole point of making something an application of XML is that you can use ready-made off-the-shelf XML processors. Making Echo something that looks like XML but tampers with things specified in the XML spec in such a way that vanilla XML processors cannot be used is a really, really bad idea.
Declaring additional entities in an external DTD subset is a bad idea for applications like Echo, because DTD processing is expensive.
If you want to express non-ASCII, use UTF-8, UTF-16 or NCRs. All conforming XML processors must support those.
This issue recently popped up on www-html. See http://lists.w3.org/Archives/Public/www-html/2003May/0155.html http://lists.w3.org/Archives/Public/www-html/2003May/0217.html and http://lists.w3.org/Archives/Public/www-html/2003May/0221.html
[DonPark] Henri, existing ready-made XML processors can use the external DTD to process Echo feeds. It is just a matter of specifying the DTD in the feed. If this is not a problem for XHTML browsers, why is it a problem for Echo?
-
[HenriSivonen RefactorOK] Entities declared in the DTD are different from the predefined entities. Non-validating XML parsers are not required to process the external DTD subset. Does it make sense to burden Echo by requiring the use of XML processors capable of handling external entities and in practice requiring app writers to implement a DTD catalog? Non-ASCII is better communicated by using UTF-8, UTF-16 or NCRs.
And the XHTML character entities are problematic with XHTML UAs. Opera and Netscape 6.x use non-validating XML processors and, therefore, are free not to process the external DTD subset. They do not support enties declared in the external DTD subset. Mozilla appears to support the XHTML character entities in common cases, but it does so using a dirty trick. It maps a few common public ids to an abridged DTD that only contains the character entity declarations. So if you observe the character entities, Mozilla appears to parse the DTD but if you observe something else like attribute defaulting, Mozilla doesn't appear to parse the DTD.
The character entities are a can of worms. Let's not open it with Echo.
[ZhangYining RefactorOk] I am for:
-
signaling escape with attribute flag, and choose one default flag;
-
use [CDATA[...]];
-
support base64 for content(or part of content) other than text;
[RichardTallent RefactorOk] What we have are two orthogonal issues:
-
Content format. We've got to get out of the HTML vs. XHTML rut here. Other compelling applications that support various other XML, non-XML, and encoded binary content will come. If FormerlyKnownAsEcho will be parsed with widely-available existing XML parsers, the final document, with its content, must be well-formed. This is best represented, from what I can tell, by a type attribute of the content element that would express the MIME type of the content.
-
Encoding method. There's more than just escaped and unescaped here. None, Xml (XHTML or others with namespaces), Escaped, Base64, YEnc, UUEncode, even ROT13 should all be valid choices. Let the tool decide what it will support, don't build it into the spec. Again, the best representation would be an encoding attribute of each content element.
New tool developers should not be stymied because of broken, unescaped HTML. Escaping requires only one or two commands on most modern platforms, but parsing FrontPage-esque HTML is a major undertaking. Put the burden on the publisher, not the consumer.
[AsbjornUlsberg] +1.000.000
[RolandWeigelt] Hmm, maybe a stupid question (and maybe I did miss something really obvious)... People seem to prefer CDATA encoding vs. entity encoding. What I don't see mentioned is that CDATA (which I generally like, BTW) has one big flaw: The text inside CDATA must not contain "]]>". Entity encoded text can be encoded over and over again. But how do you encode e.g. an XML example that contains a CDATA section using CDATA?
-
[LachlanCannon] I assume in that case you're expected to use mode="xml", since presumably if you're using xml it's well-formed.
[AsbjornUlsberg] I believe what Roland is asking about, is something like this:
<?xml version="1.0" encoding="utf-8"?> <feed> <content> <![CDATA[ <body> <em>This is not <strong>valid</em> XML, but valid HTML.</strong> <p><![CDATA[4 > 2]]></p> </body> ]]> </content> </feed>
And this is of course not valid XML. What you must do in such a case, is escaping the innermost non-valid HTML/XML:
<?xml version="1.0" encoding="utf-8"?> <feed> <content> <![CDATA[ <body> <em>This is not <strong>valid</em> XML, but valid HTML.strong> <p>4 > 2</p> </body> ]]> </content> </feed>
So CDATA can't be used in all cases, but when you have to escape something one time, it's as good (and maybe better) as entity-escaping.
[MichelValdrighi RefactorOk] Instead of coming up with cases of when to use CDATA, when to use nothing, or when to call 911, let's just pick ONE way to use the <content> element, one that doesn't force the author to write well-formed XML or to encode everything. CDATA everywhere, despite the ]]> issue, looks like the only solution that could fit all cases of content (well/bad-formed, escaped, binary (hell no, don't use binary in feeds)).
[AsbjornUlsberg, RefactorOk] So you find it very unelikely that content will ever need to be escaped more than once? What do you base that statement on? If we escape everything with CDATA, then the XML content will be of no use. The point of having wellformed, valid XML inside <content> is to be able to extract sub-content of content without any preprocessing or magic. It's just to do an XPath, and you get what you want (e.g. all "href"s inside a XHTML document).
If we escape everything, the content is useless. If the content is useless, the format is as well. Imho, of course.
-
+1 [JeremyGray] Data should only be escaped when it absolutely must (i.e. it comes from a legacy archive which for some reason cannot be updated). I'm fine with standardizing on CDATA for the times that has to happen, but let's not go and cripple the XML usage of what should be XML content in this so-called XML standard. I'm still strongly voting for the simple, capable suggestion posted by TimBray very early in the life of this page.
[HenriSivonen] All the examples show only a subset of an (X)HTML document embedded in content. That is, html, body and head have been omitted. However, the Atom 0.2 snapshot doesn't mention that (X)HTML is special in the sense that mandatory parts of an (X)HTML document may be omitted when embedded in Atom.
Resolution
Considerable thought has gone into the discussion of this issue. Coincidentally, there is a third draft of the Necho RFC. What needs to happen (e.g., What needs to be added? What needs to be taken away?) to move forward to resolution of this topic?
[JeremyGray] Someone felt it appropriate to delete my comment, one not marked with either RefactorOk or DeleteOk. This wouldn't really bother me if the changes made to the poll in any way reflected my comments, which were (and still are):
-
[JeremyGray] Don't take this the wrong way, but can we perhaps have a poll with accurate terminology and a range of choices that reflects the discussion to date? I cannot really vote in the poll as it stands.
To clarify the above even further, my point regarding 'accurate terminology' had nothing do to with the words 'yay' or 'nay'. It had to do with the misuse of the words 'quoted' and 'inline', neither of which are accurate terminology for the concepts being discussed on this wiki page. Further, the two presented choices, even once considered using accurate terminology, don't reflect the options that have actually been discussed here. At this point I'd like to see the poll rebuilt into a new poll or set of polls by an individual who was actively involved in the discussion on this page.
[JonathanSmith] Sorry, I was the one who deleted your comment and also made the cut & paste mistake with the poll. Later I realized my mistake and hoped that someone would refactor it. Instead, I came back to your criticism, so I made the changes I thought were appropriate... In the spirit of wiki I would invite you to do likewise.
-
[JeremyGray] No worries. As for doing likewise, I wanted to give the more involved parties a chance to format the poll(s), but since the page has sat for a while without one of them jumping in to do it, perhaps I'll give it a go in the near future. Can't promise being able to get to it today, though, as I have mega-work to catch up on (another of my reasons for not jumping in already, other than to comment).
See also content