Re-syndicating vs sanitizing
Just over a month ago, Tim Bray pointed both to Jacques’ Atom Torture Test, and Planet Intertwingly. Regarding the later, he noted with evident delight that NetNewsWire was able to tell him which entries he had already seen due to the fact that Planet made an effort to retain atom id’s.
Until today, it didn’t occur to me that those two were related. Programs which couldn’t handle such things as MathML do a disservice by resyndicating mangled or neutered content. This brings up a number of interesting questions. I’m going to take a stab at answering them, but in all honesty, this is a subject for interesting debate.
- I have no problem with transformations which should be lossless. Making relative URIs absolute, or even rebasing them should be OK. Adding or removing ignorable whitespace, and even lowercasing element names should be OK.
- Now matter how extensive the test suite, bugs are a fact of life. If there is a substantive difference that creeps in unintentionally, this should be treated as a bug and fixed.
- Policy decisions are another matter. If styles or scripts are stripped, then either a new atom:id needs to be minted or that particular entry needs to not be re-syndicated.
- Unsupported features, like MathML or inline SVG, are a special case of policy decisions, and probably should be treated likewise.
The first step is that the Feed Parser needs to be modified to return back a flag for each entry indicating whether that entry has been sanitized.