UserPreferences

PaceShouldBeWellFormed


Abstract

[MarkPilgrim] Redefine the rules for determining the character encoding of an Atom feed served over HTTP.

Status

Withdrawn

Rationale

RFC 3023 defines rules for determining the character encoding of a feed (or any other XML document served over HTTP). The default configuration for most web servers is to serve ".xml" files as "text/xml" with no charset parameter. According to RFC 3023, all of these feeds MUST be parsed as "us-ascii". This is nonintuitive and unacceptable for UnprivilegedUsers, who are left without a way to publish Atom feeds in any other encoding.

Proposal

Insert as section 6:

6. Client processing requirements

Atom feeds in local files MUST be well-formed XML 1.0, as defined in Section 2.1 of the XML specification <http://www.w3.org/TR/REC-xml/#sec-well-formed>.

Atom feeds served over HTTP SHOULD be well-formed XML 1.0. Well-formedness of XML documents relies on first determining the character encoding, and RFC 3023 defines how to determine the character encoding of XML documents served over HTTP. However, there are a large number of publishers who wish to serve Atom feeds over HTTP, but who do not have sufficient control over their HTTP server to control the media type. Therefore, the following rules supersede the rules defined in RFC 3023:

6.1 Determing the character encoding of an Atom feed

  1. Publishers MAY include the charset parameter along with the media type. If the charset parameter is present, clients MUST parse the Atom feed in that charset, ignoring any charset declared in the encoding attribute of the XML declaration.

  2. Publishers SHOULD serve all Atom feeds with the media type "application/atom+xml" (registered in Section 8 of this document). Clients MUST treat "application/atom+xml" as "application/xml" and determine the character encoding as per RFC 3023.

  3. If a publisher wishes to serve an Atom feed over HTTP, but for some reason they are unable to use the "application/atom+xml" media type, the publisher SHOULD use "application/xml", and clients MUST determine the character encoding as per RFC 3023.

  4. If a publisher is unable to use "application/atom+xml" or "application/xml", they MAY use "text/xml". According to RFC 3023, XML documents served as "text/xml" with no charset parameter have a character encoding of "us-ascii".

    1. The publisher SHOULD escape non-US-ASCII characters as [WWW]character references, e.g. '&#xf8;' for the character ''.

    2. Clients MUST begin to parse such documents with a "us-ascii" encoding. However, if the root-level element is in the Atom namespace (defined in Section 1.3 of this document), then clients MUST reparse the document as if its media type had been declared as "application/xml", and determine the character encoding according to the rules in RFC 3023 for "application/xml".

  5. If a publisher is unable to use any registered XML media type, they MAY serve Atom feeds with any registered media type. Clients MUST begin to parse such documents as if they were XML documents, with a "us-ascii" encoding. However, if the root-level element is in the Atom namespace (defined in Section 3.1 of this document), then clients MUST reparse the document as if its media type had been declared as "application/xml", and determine the character encoding according to the rules in RFC 3023 for "application/xml".

6.2 Handling well-formedness errors

After determining the character encoding by the rules in section 6.1 of this document, clients MUST use a conforming XML parser to parse an Atom feed. In particular, clients MUST stop processing at the first well-formedness error, although they MAY display any information they have parsed before the first well-formedness error.

Here is a non-comprehensive list of things clients have been known to do after encountering a well-formedness error, which this document specifically prohibits:

Impacts

This proposal has significant impact for clients, who must implement custom handling of Atom feeds to properly determine the character encoding. However, research has shown that virtually no clients currently determine character encoding properly anyway, so the impact may be reduced.

Notes

[MarkPilgrim] After careful consideration and discussion with the author of RFC 3023 and others, I am withdrawing this proposal. It is inappropriate for Atom to redefine the behavior of a media type that isn't ours. Apache has fixed their problem (latest version of 1.3 and 2.0 now default to "application/xml" for .xml files), so this mess will sort itself out in five to ten years. If you can't live with XML's rules in the meantime, then you shouldn't have chosen XML.


CategoryProposals