In my experience, people don't read specs carefully, instead
they view source and emulate. And when they emulate content
that is escaped without a clear signal, they emulate poorly.
I'd like to get to the point where the original functionality
of the RSS 0.90 link tag can be achieved with the xpath
expression "//a/@href" on those feeds that have well formed
HTML.
If you are a user of a recent version IE or Mozilla, you
already have a validator for wellformedness.
Making the signal an element instead of an attribute makes life
easier for both tag soup regex based approaches as well as
validated schema based approaches.
Ultimately, I would like to be able to move on to discussing such things as how relative URLs are to work, and I fundamentally believe that programmatic adjustment of content which is not well formed is an unsafe proposition.
Sam, re attributes vs. elements, I think this is a clear case for using attributes.
The serialization-related interpretation of content of an XML element should not depend on the content itself. (See XML's encoding attribute which is a major pain, IMHO.)
Your proposal is a case in point: it is non-orthogonal in the sense that both the second and third form are also valid first form. If we want to add other "encoding" methods later, we might find that we can't.
About an encoding attribute, we can also swap the senses of (what I called) "none" and "literal" so that "XML" is the default ("none") and "escaped" is the encoding of legacy content.
Can somebody write up the appropriate xml schema or relaxNG or dtd for varying the content of what is inside based on an attribute? And do the same based on an element?
Can somebody write up a regular expression looking for an xml element of a given name vs looking for an attribute?
The essential difference between "my" and "Ken's" proposals boil down to this. By looking at the tangible implications of the decision, I believe we can come to consensus quickly.
FYI: the reason why I ask the questions above is that I believe I know the answers, and they support the argument for an element. But please, do the exercise for yourselves and see if you come to the same conclusion.
Sjoerd: it isn't clear to me why one would need an <xml> element inside xml to say that the child of a given node is xml, but if that helps us come to consensus, I'm game.
Why do we need so many ways of expressing content? It's either textual (in which case what is wrong with CDATA alone?), or non-textual (in which case base64 is acceptable).
Since we can boil it down to two types, it really doesn't need to be that flexible, you can just have <inline> and <encoded> element types.
The rationale that "people don't read specs" is a flimsy one, imho. The aggregators should throw out malformed content instead of trying to process it. Tag soup regexps is something we should be avoiding, not finding workarounds for.
AFAIK you cannot say anything about the <escaped> element then.
I still haven't seen a response for the need of base64 encoding. Even a pictureblog doesn't need it. (The image is not going to get from the camera to your entry post form in base64 format.) It's much easier to use other better tools to put your pictures online, and just post the link.
Sam,
Is there really a demand to include base64 encoded stuff in content? I'd keep stuff like that out of V1 of Echo/Pie/whatever and keep at as an idea for v-next
Zip, Sjoerd: I've added an <xml> element to my proposal on the wiki.
Sjoerd, Dare: it is my believe that this will be important when we come to the API and archiving portions of the roadmap. If we don't then this clearly won't survive the v1 cut. For now, I'm simply content if we accept the premise that the way in which the bytes are going to be expressed isn't necessarily going to remain a binary decision for now and forever.
The more important question to me is attribute vs element. I claim that it is easier to parse for an element using regular expressions than to scan for an attribute (or a portion thereof, in Dare's proposal). I also claim that it is easier to create a DTD or schema in which the valid children depend on the name of the parent element instead of some heuristics based on one or more attributes.
Based on the example above, signal elements are only present when the content is escaped. Is this true? If so, then an application cannot determine if an unknown element is an unknown signal element or an unknown content element. This is a problem because unknown signal elements and unknown content elements will be handled differently by many applications.
Sjoerd & Sam,
It is an unfortunate aspect of working with XML that people decide to limit their XML vocabularies due to the short sightedness of the W3C XML Schema working group. Quite frankly, I believe attributes work better for describing an elements metadata as opposed to being shoved into its content and also believe one can write a RELAX NG schema that can describe these constraints. Similarly a W3C XML Schema annotated with Schematron assertions could also describe these constraints.
I'm going to download Jing and see if I can write a RELAX NG schema for Tim Bray's proposal. If so I'll post it in a few.
PS: I prefer Tim Bray's proposal to mine. I'll probably withdraw mine and + 1 his instead.
Dare, it is indeed unfortunate. However W3C XML Schema is far more widely supported, so Echo should support it too. The only formats that can reasonably use RelaxNG are meta standards like XSLT and RDF, that have no chance of getting a useful W3C XML Schema. Echo does not fall in that category.
Sjoerd,
There you go. Schemas for Tim Bray's proposal in RELAX NG and XSD. The RELAX NG schema is stricter than the XSD schema due to limitations of XSD. For instance
<content type="escaped">
I am notescaped
</content>
is not caught by the XSD schema but should be by the RELAX NG schema.
PS: I'm curious, on what platform are there implementations of XSD validators and none for RELAX NG?
Now Sam has agreed on the <xml> element, the only reason to choose attributes over elements is taste. Attributes look better. I don't think that warrants dropping the most used xml schema language.
XPath 2.0/XSLT 2.0/XQuery are built on XML Schema. Every new XML technology from Microsoft is built on XML Schema. I work at Q42, where we're building Xopus, an XML Editor. All our customers use XML Schema, we like to build support for RelagNG, but we've had not requests to do so yet. RelaxNG might be the new cool thing, but the corporate world doesn't use it yet.
Sjoerd,
I prefer elements to attributes but you are right that with attributes W3C XML Schema cannot describe the content model strictly so we are better of using a content model that can be described strictly with both languages.
I am for attributes, and I think trying to put (X)HTML validation into the Echo Schema is a bad idea to begin with. It will make the Echo format that much more fragile. Are you going to add in validation for 'h' and 'section' elements, which are part of XHTML 2.0? How about SVG and MathML elements, depending on the profile of XHTML chosen? Any schema should be for the Echo parts of the format only.
If someone puts a valid element (for their profile of (X)HTML) into an Echo feed that causes their feed to suddenly be invalid, what are they going to do? I would guess they'd go back to escaping their HTML, the opposite direction we want to be going. I would avoid having the Echo schema try to validate anything but the 'Echo' parts of the format.
Of course, this does raise the question of how to indicate which version of (X)HTML you are stuffing into that 'content' element, so you can get the right Dchema or DTD to validate it against.
Joe,
I don't think anyone is asking for XHTML validation in the Echo spec. What gave you that impression? As for worrying about which versions of XHTML are used I'd suggest that the Echo (we really need a new name) spec should just mandate XHTML 1.0 transitive for V1 and revisit the issue in V2.
Who said anything about (X)HTML validation. The point is that both the schema and example feeds clearly indicate what is going on and what the options are. The schema I wrote was a bit buggy, it should have been:
That is: a sequence of any element, mixed with text. processContents="lax" means that the elements don't need to be valid, unless there is a declaration available.
So if somebody want to create a validator that validates echo feeds that may only contain valid XHTML, then he/she can create a new schema, and import the XHTML schema and the Echo schema, and that's it.
I'm not much of a big-S schema person, but if it's about explicit interoperability with XML Schema, besides allowing anyXML and doing base64/xs:string only at the application level, I'm probably a +1 on it.
I'm not sure regexing is a driving factor, but it would be much less so than XML Schema interoperability. I tried looking for regex-based RSS parser source for context, but couldn't find one. I think regex-parsers are still going to have a problem with
Sam and Tim: give eachother a phone call and make up your minds. It may be hairsplitting, but I don't like "close enough" arguments. Do we choose for one level extra validation or a nice syntax?
base64 support (or other way to include binary data) is important.
There must be a way to upload images and other media using the API, and ideally a way to archive the whole blog including images. Base64 encoding is the most straightforward solution.
Re base64 - what's wrong with a stock href, instead of clogging a feed item with a boatload of image data? The last thing I want coming down the wire with a feed is a few hundred K of encoded image content.
re content encoding - the XML parser I use handles this just fine without me paying any attention to it at all - escaped, not escaped, CDATA - it all gets handed to me just fine by the VisualWorks Smalltalk XML parser. Gads, I wouldn't want to have to use regex or deal with figuring out the encoding in application level code, and no one else should want to either.
James,
Go back to Misha's original context: "There must be a way to upload images and other media using the API, and ideally a way to archive the whole blog including images. "
Echo will not only be a base for aggregation but also for an archive format and for an editing API, as such the ability to include images and other media is required for the format when it is used in those contexts.
There has been a great deal of forward motion in the Echo project today. Looks like the discussion about escaping HTML has come to a conclusion. Other areas that have settled seem to be Author and PermLinks. Things are looking very good for Echo,...
Re base64, the API, and the format - I really, really don't want to see base64 in feeds - at the same time, I see a lot of value having support for that in the API - I post to my blog that way - I use an url encoded form, but the form elements are encrypted (and the encrypted elements placed in base64 for transmission). i just think it might well be overdoing consistency to worry about this in the feed format. IMHO, the feed requirements are, in fact, different than the posting API requirements. I'm completely not sold on worrying about an archive format - so long as a weblog responds properly to API requests, what difference does it make how entries are stored?
James,
"Archive format" had nothing to do with specifying the how entries are stored on the server. It has to do with the ability to export and import entries from one weblogging system to another.
It appears that the latest consensus is to have one of (content), (content mode="escaped"), and (content mode="base64"). I can certainly live with that.
A couple of notes, though:
1. The name "mode" is overloaded in our business. Eventually it will clash with something else we'd like to call "mode". I suggest we rename it "escaping" instead.
2. The default option (if you don't specify @mode|@escaping) should also have its own tag name, say "xml". So (content) and (content escaping="xml") would be equivalent. This makes it easier for tool builders refer to the "default" mode of operation in their code consistently (oops, did I say "mode"?)
Forward Motion. There has been a great deal of forward motion in the Echo project today. Looks like the discussion about escaping HTML has come to a conclusion. Other areas that have settled seem to be Author and PermLinks. Things are looking very...
Mark has two RSS feeds (both 2.0 flavor) one for his blog and another for comments. removed the funk from his RSS. removed the link element. removed the dc:date element. His dates were off by an hour. removed the rich content. removed the comment...