Scoping out a C++ HTML5 parser

2010-10-21T17:04:59Z

Henri Sivonen: the idea is the have a library that does HTML5 parsing and is API-compatible with libxml2

I do quite a bit of HTML-scraping (example: output). Lately my tool of choice is Nokogiri. Is based on libxml2, which does a good job on reasonably well-formed HTML. I decided to run it against the html5 tokenizer and tree-construction tests.

For the tokenizer tests, I walked the resulting tree to infer what the tokens might have been, and I didn’t worry about reporting of ParseErrors. This means that there may be automatic actions (such as closing of elements) which will be reported when there wasn’t any token that caused such to happen. Additionally, while I normally parsed each string as a fragment, Nokogiri wouldn’t accept a DOCTYPE in a fragment, so I used the full HTML parser when a DOCTYPE is present. Long story short: while there were plenty of differences, not all of them are significant.

Similar story for the tree tests: I didn’t worry about the DOCTYPE differences, and I didn’t worry about missing empty head elements (Nokogiri only adds this element when necessary which would have resulted in a lot more failures). Therefore, in this case the differences don’t quite tell the whole story.

My conclusion is that libxml2’s HTML parser is far from HTML5 compliant — not that it ever claimed to be. I’m just verifying that there indeed is a hole for which the library that Henri is starting the initial work to scope out would fill.