It’s just data

Scoping out a C++ HTML5 parser

Henri Sivonen: the idea is the have a library that does HTML5 parsing and is API-compatible with libxml2

I do quite a bit of HTML-scraping (example: output).  Lately my tool of choice is Nokogiri.  Is based on libxml2, which does a good job on reasonably well-formed HTML.  I decided to run it against the html5 tokenizer and tree-construction tests.

For the tokenizer tests, I walked the resulting tree to infer what the tokens might have been, and I didn’t worry about reporting of ParseErrors.  This means that there may be automatic actions (such as closing of elements) which will be reported when there wasn’t any token that caused such to happen.  Additionally, while I normally parsed each string as a fragment, Nokogiri wouldn’t accept a DOCTYPE in a fragment, so I used the full HTML parser when a DOCTYPE is present.  Long story short: while there were plenty of differences, not all of them are significant.

Similar story for the tree tests: I didn’t worry about the DOCTYPE differences, and I didn’t worry about missing empty head elements (Nokogiri only adds this element when necessary which would have resulted in a lot more failures).  Therefore, in this case the differences don’t quite tell the whole story.

My conclusion is that libxml2’s HTML parser is far from HTML5 compliant — not that it ever claimed to be.  I’m just verifying that there indeed is a hole for which the library that Henri is starting the initial work to scope out would fill.


libxml2 is that widespread, can’t it be modified? Do the old HTML parsing semantics have to be preserved? Were they even specified?

Posted by Astro at

I’ve attempted to send a post to the [mailing list] with this question, but it hasn’t shown up.  I’ve followed up with a request to the list owner... hopefully it will appear shortly.

Posted by Sam Ruby at

As someone who attempted to keep an implementation of an HTML5 parser up-to-date for some period of time last year, I will say it’s time-consuming, thankless work, especially since the spec was changing a lot. Now that things have settled down a bit, it might be good to start making full implementations.

Posted by Edward Z. Yang at

posted

Posted by Sam Ruby at

blogging is always fun, as via this link found one of my best friend that is none other than . i would recommend this link to everyone.

Posted by how to write an introduction at

Add your comment