Companion to Atom
Work in progress. By Sam Ruby
Unicode for Syndication consumers
XML is carefully designed to enable character encoding to be reliably detected and handled correctly. Unfortunately, not all content producers are as careful as they ought to be. Templates are used with content from third party sources which haven’t been properly tagged, and the result is XML which is not well formed. Conformant implementations of XML parsers are required to reject such documents outright, placing the problem of remediation squarely in the application’s lap.
In all cases, the right first course of action is to use a proper XML parser. This will enable documents which are encoded properly to be processed correctly.
In the cases where this fails, the application may choose to report a failure, or it may chose to recover. Prior to any attempted recover, there needs to be a determination of the declared encoding and the actual encoding.
Determining the declared (or implied) encoding
If the first byte of the feed is whitespace (hex 0A
,
0C
, or 20
); or if the first byte is a ASCII less
than sign (hex 3C
) followed by an alphabetic character
(hex 40
through 7A
), then the explicit encoding
declaration is missing, and the default encoding of utf-8 is to be used.
If the four bytes of the feed are hex 3C 3F 78 6D
, then there
may be an explicit encoding. Such an encoding can be recognized as
either the ASCII string encoding="name"
or
encoding='name'
which will occur prior to a second question mark
(hex 3F
). If such a name is not present, the default
encoding is again utf-8.
These two rules cover the vast majority of cases where the encoding is incorrect, but correctable. In all other cases, it probably is wise to assume that the problem is not correctable.
Determining the actual encoding
If all bytes are hex 0A
, 0D
, or in the range hex
20
through 7E
, then the encoding is most likely
us-ascii.
If all bytes in the range hex C0
through FF
are
followed by bytes in the range hex 80
though BF
, AND
all bytes in the range hex 80
through BF
are
preceded by bytes in the range 80
through FF
, then
the encoding is most likely utf-8.
If all bytes are in a range acceptable to us-ascii OR in the range hex
A0
through FF
, then the encoding is most likely in
the iso-8859-n family of encodings. Here you do have to do a little
guessing. The iso-8859-1 encoding covers most of Western Europe,
iso-8859-5 covers Cyrillic, iso-8859-7 covers Greek, etc. In the case of
a missing or invalid encoding declaration, my experience is that iso-8859-1 is
the most often correct interpretation.
If all bytes are in a range acceptable to the iso-8859 family of encodings
or in the range hex 80
through 9F
, AND you have both
an indication that there is an invalid encoding (as evidenced by a parser
error), and an indication that the document is generically in an ASCII based
encoding (as evidenced by a match on one of the two rules specified in the
Determining the declared or implied encoding section), then you may be seeing
what is probably the most common encoding error found in syndication feeds:
stray
windows-1252
characters in an otherwise valid document.
Recovery
Once you have determined the declared and (probable) actual encodings, you have a number of options for recovery. You can attempt to insert or correct the EncodingDecl to match the actual encoding, or you can convert the document to match the declaration. In you opt to go the latter route, here are some libraries that may help:
Disclaimer
Automated recovery from errors such as these is understandably controversial.
- It rewards bad behavior.
- While it makes surprises less frequent, it can tend to make the severity of such surprise more severe.
- When surprises do occur, the root cause may be obscured by whatever recovery measures have been attempted.
For these reasons, it is best to determine which conditions you are willing to attempt to recover from, and for all others, you might consider providing a link to the feedvalidator.
In all cases, the choice to recover (or not) is something that an application needs to take responsibility for.