intertwingly

It’s just data

Ruby HTML5 Tokenizer


Henri Sivonen: I expected that it would make sense to use RELAX NG for expressing virtually all HTML5 conformance requirements that could theoretically be expressed in RELAX NG. This expectation turned out to be incorrect.

Perhaps a DSL would be appropriate?

Background: the FeedValidator pumps a set of SAX events into a data-structure that also happens to be executable Python code.  Given what I now know, I believe I can do much better; particularly if I picked a language that excelled at producing usable DSLs.

So, the first step is to port the HTML5lib tokenizer to Ruby.  This code makes use of the chardet, iconv, and json, modules, if available.  It passes the tokenizer tests, bar two:

Loaded suite runtests
Started
.............................................FF....................
Finished in 0.202438 seconds.

  1) Failure:
test_50(Html5TokenizerTestCase)
    [./tests/test_tokenizer.rb:152:in `test_50'
     ./tests/test_tokenizer.rb:135:in `test_50']:

    Description:
        Numeric entity representing a codepoint after 1114111 (U+10FFFF)

    Input:
        �

    Content Model Flag:
        PCDATA
.
<["ParseError", ["Character", "\357\277\275"]]> expected but was
<[["Character", "\370\210\237\221\206"]]>.

  2) Failure:
test_51(Html5TokenizerTestCase)
    [./tests/test_tokenizer.rb:152:in `test_51'
     ./tests/test_tokenizer.rb:135:in `test_51']:

    Description:
        Hexadecimal entity representing a codepoint after 1114111 (U+10FFFF)

    Input:
        &#x1010FFFF;

    Content Model Flag:
        PCDATA
.
<["ParseError", ["Character", "\357\277\275"]]> expected but was
<[["Character", "\374\220\204\217\277\277"]]>.

67 tests, 67 assertions, 2 failures, 0 errors

At the moment, I’m inclined to believe that the Ruby implementation is correct in these two cases, and that the test suite is checking for incorrect behavior that fundamentally derives from a current Python limitation.

Update: it turns out that sys.maxunicode is supposed to be 1114111 (U+10FFFF). I’ve added a check in tokenizer.rb, and now all the tests pass.  As an aside, unichr(65536) works on my machine (Python version 2.5.1c1)