Trackback in, valid out (mostly)
Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.
It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.
So, my current heuristics are as follows: if the data is valid utf-8, I accept it as such. If not, I assume windows-1252, and convert it to utf-8. This had failed me once, but my page is still valid.
Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.