It’s just data

Ruby HTML5 Tokenizer

Henri Sivonen: I expected that it would make sense to use RELAX NG for expressing virtually all HTML5 conformance requirements that could theoretically be expressed in RELAX NG. This expectation turned out to be incorrect.

Perhaps a DSL would be appropriate?

Background: the FeedValidator pumps a set of SAX events into a data-structure that also happens to be executable Python code.  Given what I now know, I believe I can do much better; particularly if I picked a language that excelled at producing usable DSLs.

So, the first step is to port the HTML5lib tokenizer to Ruby.  This code makes use of the chardet, iconv, and json, modules, if available.  It passes the tokenizer tests, bar two:

Loaded suite runtests
Finished in 0.202438 seconds.

  1) Failure:
    [./tests/test_tokenizer.rb:152:in `test_50'
     ./tests/test_tokenizer.rb:135:in `test_50']:

        Numeric entity representing a codepoint after 1114111 (U+10FFFF)


    Content Model Flag:
<["ParseError", ["Character", "\357\277\275"]]> expected but was
<[["Character", "\370\210\237\221\206"]]>.

  2) Failure:
    [./tests/test_tokenizer.rb:152:in `test_51'
     ./tests/test_tokenizer.rb:135:in `test_51']:

        Hexadecimal entity representing a codepoint after 1114111 (U+10FFFF)


    Content Model Flag:
<["ParseError", ["Character", "\357\277\275"]]> expected but was
<[["Character", "\374\220\204\217\277\277"]]>.

67 tests, 67 assertions, 2 failures, 0 errors

At the moment, I’m inclined to believe that the Ruby implementation is correct in these two cases, and that the test suite is checking for incorrect behavior that fundamentally derives from a current Python limitation.

Update: it turns out that sys.maxunicode is supposed to be 1114111 (U+10FFFF). I’ve added a check in tokenizer.rb, and now all the tests pass.  As an aside, unichr(65536) works on my machine (Python version 2.5.1c1)

Nice. I’ve actually been working on a ruby html5 parser, but don’t have anything public to show yet (maybe I should get it out anyway). I’ve already squatted on the rubyforge project.

Posted by ryan king at


Definitely looking forward to a working ruby implementation of html5.

Posted by Bob Aman at

RELAX NG and Schematron are languages for specific domains even if they are not domain-specific programming languages. The cases that they do not cover are particular and varied enough that using a general-purpose language seems more reasonable to me than writing a DSL that can cope with the remaining cases.

The crux of the quote in the post is that exclusions are better done in Schematron than in RELAX NG.

Posted by Henri Sivonen at

Sam Ruby: Ruby HTML5 Tokenizer


Excerpt from at

Ruby the Same Token

Ruby HTML5 tokenizer, written, fittingly, by Sam Ruby, porting the Python HTML5 html5lib tokenizer. As a result of rigorous specification and careful thought, the HTML5 tokenizer is pretty easy to write AND works on almost all HTML and XHTML...

Excerpt from Waffle at

Add your comment