I think Simon should share some of the blame for this. According to the HTML 4.01 spec:
authors should use “>” (ASCII decimal 62) in text instead of “>” to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.
I don’t think he should. You’re quoting from a W3C Recommendation of 24 December 1999 - what ‘older user agents’ referred to there are still in use? Compatibility suggestions should not be an excuse for continued poor coding standards.
Compatibility suggestions should not be an excuse for continued poor coding standards.
Except I wasn’t excusing anyone. I think both producer and consumer share part of the blame. And this wasn’t a case of poor coding standards. It was merely an unwise attempt to compensate for a fairly common HTML error.
Is it an attempt to compensate though?
Though the failing parsers have the odd dodgy line like data = data.split('>', 1)[1] (how sure are we that Tidy always outputs the escaped form?) and the regexp pattern [^>]* (pop quiz, which of DTD declarations, the XML prologue, and processing instructions, may contain ‘>’ unescaped?) - seems all three are failing in this case for trying to do the right thing: relying on a dedicated parser rather than rolling their own.
However, it’s the python sgmllib, which er, doesn’t really inspire confidence:
# XXX This only supports those SGML features used by HTML.
...
def parse_starttag(self, i):
...
# XXX Can data contain &... (entity or char refs)?
# XXX Can data contain < or > (tag characters)?
# XXX Can there be whitespace before the first /?
Why’s it not been checked? Is there a canonical spec online somewhere? Is it even relevant as HTML isn’t really (now? ever?) SGML? Does anyone use the lib to parse real SGML, would it even work? Should html5lib be considered along with httplib2 for inclusion in the python 3.1 standard library as spring cleaning these crufty out of date corners?
As I now seems like a total ass, I’d better point out I’ve written far worse code that string-wrangles HTML rather that treating it as markup. It’s still the easiest way to get stuff done, but that’s not a good thing.
How does Postel’s law even come into this? The quoted element is perfectly valid HTML. Surely being liberal in what you accept is on top of accepting valid content, such as the element?
@anonymous: Off the top of my head, DTD declarations and PI may contain ‘>’ unescaped. Also, the only way to get the SGML spec is to buy it from the ISO.
I would like to see html5lib in Python by default one day, but I think it’s better to wait for a version that simply has Python bindings to a C implementation.
On further examination, I have to agree with Geoffrey here. Simon’s current HTML is invalid for a number of pedantic reasons (mostly unescaped ampersands), but this unescaped bracket is not one of them. The real story here is that sgmllib sucks for parsing HTML, which is neither interesting nor news. At some point we should just cut over to using html5lib to parse all HTML and HTML-like content in feedparser (if we haven’t already). In the meantime, I am interested in what you think the problem really is, and what possible solutions you would offer that aren’t worse than the original problem.
Simon recently linked to a post of mine. His post wasn’t auto-excerpted and linked to from my original post as I was relying on sgmllib to perform autodiscovery.
Whether his web page is valid or deprecated or not, the autodiscovery link is picked up by FireFox and will be picked up by all HTML5 compliant browsers. I haven’t checked other current browsers, but as HTML5 tends to adopt all sane error recovery implemented by IE, I suspect that IE will pick up the link too.
The patch I described was not recently engineered specifically for this purpose. I created it over a year ago. It was put into the feed parser around the same time. Adding that patch to my extractor code addressed my immediate problem.
In any case, you’ve already alluded to what I consider the right fix. I’d like to see the Universal Feed Parser V5 pre-req html5lib. And in a way that is more than a rip and replace. In Venus, I’m actually taking the output of the feed parser and passing it through html5. And converting the UserDicts back into a feed. And there are some scenarios concerning namespaced extensions that I can’t handle.
What would be better for me is for the feedparser to not merely use html5lib when convenient, but be designed to build a tree for the entire feed. And for the default treebuilder for feed parsing purposes be a new treebuilder based on UserDicts. Other callers could, instead, substitute a different treebuilder based on their needs.
Are there any good reasons not to escape ampersands and both opening and closing brackets in all SGML- and XML-like markup? Yes, Simon’s markup is perfectly valid and it’s the library consuming his markup which is to blame for coughing up a hairball over it, but still. Why insist on producing potentially invalid markup when you can avoid it? Postel’s law is bi-directional.