It’s just data

Scoping out a C++ HTML5 parser

Henri Sivonen: the idea is the have a library that does HTML5 parsing and is API-compatible with libxml2

I do quite a bit of HTML-scraping (example: output).  Lately my tool of choice is Nokogiri.  Is based on libxml2, which does a good job on reasonably well-formed HTML.  I decided to run it against the html5 tokenizer and tree-construction tests.

For the tokenizer tests, I walked the resulting tree to infer what the tokens might have been, and I didn’t worry about reporting of ParseErrors.  This means that there may be automatic actions (such as closing of elements) which will be reported when there wasn’t any token that caused such to happen.  Additionally, while I normally parsed each string as a fragment, Nokogiri wouldn’t accept a DOCTYPE in a fragment, so I used the full HTML parser when a DOCTYPE is present.  Long story short: while there were plenty of differences, not all of them are significant.

Similar story for the tree tests: I didn’t worry about the DOCTYPE differences, and I didn’t worry about missing empty head elements (Nokogiri only adds this element when necessary which would have resulted in a lot more failures).  Therefore, in this case the differences don’t quite tell the whole story.

My conclusion is that libxml2’s HTML parser is far from HTML5 compliant — not that it ever claimed to be.  I’m just verifying that there indeed is a hole for which the library that Henri is starting the initial work to scope out would fill.


libxml2 is that widespread, can’t it be modified? Do the old HTML parsing semantics have to be preserved? Were they even specified?

Posted by Astro at

I’ve attempted to send a post to the [mailing list] with this question, but it hasn’t shown up.  I’ve followed up with a request to the list owner... hopefully it will appear shortly.

Posted by Sam Ruby at

As someone who attempted to keep an implementation of an HTML5 parser up-to-date for some period of time last year, I will say it’s time-consuming, thankless work, especially since the spec was changing a lot. Now that things have settled down a bit, it might be good to start making full implementations.

Posted by Edward Z. Yang at

posted

Posted by Sam Ruby at

The article is follows the interesting reports with scoping out c++, and HTML files. The parser cheapest assignment writing books are follows every moment from this blog. Then we will catch more interesting features and essays.

Posted by Ditya at

Amazing information is provided about c++ and html.

Posted by USPS tracking number at

Its so easy step for the all window user because in this setting how to change ringtone you can see that in this phone all setting is different.

Posted by vedant at

However, you have to remember that LiteBlue is provided for liteblue for The usps retirement form is for collections of U.S.P.S staff.

Posted by Nicolash Martin at

such fantastic information..u just keep it up guys...

Posted by jarin at

I truly like you’re composing style, incredible data, thankyou for posting.

Posted by johnhook89 at

Nice post! This is a very nice blog that I will definitively come back to more times this year! Thanks for informative post.

Posted by timpaine89 at

Thanks for sharing, i reviewed all sites but all are not do follow…. some sites are good. Do you accept the free guest post on your blog? Thanks a lot…

Posted by youngparry87 at

I am impressed by the information that you have on this blog. It shows how well you understand this subject...

Posted by travishead77 at

nice blog

Posted by symonds89 at

Add your comment