Companion to Atom

Work in progress.  By Sam Ruby

Unicode for Syndication consumers

XML is carefully designed to enable character encoding to be reliably detected and handled correctly.  Unfortunately, not all content producers are as careful as they ought to be.  Templates are used with content from third party sources which haven’t been properly tagged, and the result is XML which is not well formedConformant implementations of XML parsers are required to reject such documents outright, placing the problem of remediation squarely in the application’s lap.

In all cases, the right first course of action is to use a proper XML parser.  This will enable documents which are encoded properly to be processed correctly.

In the cases where this fails, the application may choose to report a failure, or it may chose to recover.  Prior to any attempted recover, there needs to be a determination of the declared encoding and the actual encoding.

Determining the declared (or implied) encoding

If the first byte of the feed is whitespace (hex 0A, 0C, or 20); or if the first byte is a ASCII less than sign (hex 3C) followed by an alphabetic character (hex 40 through 7A), then the explicit encoding declaration is missing, and the default encoding of utf-8 is to be used.

If the four bytes of the feed are hex 3C 3F 78 6D, then there may be an explicit encoding.  Such an encoding can be recognized as either the ASCII string encoding="name" or encoding='name' which will occur prior to a second question mark (hex 3F).  If such a name is not present, the default encoding is again utf-8.

These two rules cover the vast majority of cases where the encoding is incorrect, but correctable.  In all other cases, it probably is wise to assume that the problem is not correctable.

Determining the actual encoding

If all bytes are hex 0A, 0D, or in the range hex 20 through 7E, then the encoding is most likely us-ascii.

If all bytes in the range hex C0 through FF are followed by bytes in the range hex 80 though BF, AND all bytes in the range hex 80 through BF are preceded by bytes in the range 80 through FF, then the encoding is most likely utf-8.

If all bytes are in a range acceptable to us-ascii OR in the range hex A0 through FF, then the encoding is most likely in the iso-8859-n family of encodings.  Here you do have to do a little guessing.  The iso-8859-1 encoding covers most of Western Europe, iso-8859-5 covers Cyrillic, iso-8859-7 covers Greek, etc.  In the case of a missing or invalid encoding declaration, my experience is that iso-8859-1 is the most often correct interpretation.

If all bytes are in a range acceptable to the iso-8859 family of encodings or in the range hex 80 through 9F, AND you have both an indication that there is an invalid encoding (as evidenced by a parser error), and an indication that the document is generically in an ASCII based encoding (as evidenced by a match on one of the two rules specified in the Determining the declared or implied encoding section), then you may be seeing what is probably the most common encoding error found in syndication feeds: stray windows-1252 characters in an otherwise valid document.

Recovery

Once you have determined the declared and (probable) actual encodings, you have a number of options for recovery.  You can attempt to insert or correct the EncodingDecl to match the actual encoding, or you can convert the document to match the declaration.  In you opt to go the latter route, here are some libraries that may help:

Disclaimer

Automated recovery from errors such as these is understandably controversial.

For these reasons, it is best to determine which conditions you are willing to attempt to recover from, and for all others, you might consider providing a link to the feedvalidator.

In all cases, the choice to recover (or not) is something that an application needs to take responsibility for.

Search

Valid XHTML 1.1!