intertwingly

It’s just data

Safely consuming RSS: RegExps don't cut it


Simon Willison: Parsing simple HTML with regular expressions is unpleasant but possible, but attempting to securely filter potentially malicious HTML (while trying to keep the useful tags) can only lead to more problems. There are just too many possible combinations, thanks mainly to the huge flexibility provided by modern browsers. Attributes can be left unquoted, tags can be left unclosed, characters can be incorrectly escaped; it all adds up to far more variables than even the most comprehensive regexps can hope to match.

Personally, I'd like to see content producers meet content consumers half way.