It’s just data

HTML5 in Gecko

Henri Sivonen: The effort of putting an HTML5 parser inside Gecko takes a step out of the vaporware land.

I can confirm that it displays this page, served as text/html as well as the same bytes served as application/xhtml+xml.  For comparison, Chrome does nearly as well, simply omitting the SVG images.  Opera doesn’t fare as well, apparently not recognizing the self-closing SVG tags.  IE 8’s support is simply sad, well beyond the lack of processing of SVG images.

I’ll also note that the HTML5 validator doesn’t yet accept SVG in content served as text/html.  Hopefully the SVG and WHATWG/HTML5 working groups can resolve their differences in 2009 and this can be fixed.

Henri’s approach is interesting.  He starts from a single source, in Java.  The Java code can be compiled to Java byte codes, JavaScript source, or C++ presumably making use of Mozilla libraries for things such as memory management.  If he can do that, it seems to me to be a rather small leap from there to producing C++ using, say, either Ruby or Python libraries for memory management, as well as a thin binding to the language.  C# would also be a reasonable target.

If this could be done, and made available under a liberal license, it could go a long way towards making available consistent and performant implementations of the HTML5 parser algorithm everywhere.


submitted by gthank [link] [0 comments]...

Excerpt from programming: what's new online at

FWIW, what Opera does is conforming per HTML5 as it stands today. Once we have SVG in text/html support it will “work” of course, but now we just parse that as if they were unknown elements.

Posted by Anne van Kesteren at

Anne, the HTML5 draft states the following:

Then, if the element is one of the void elements, or if the element is a foreign element, then there may be a single U+002F SOLIDUS (/) character. This character has no effect on void elements, but on foreign elements it marks the start tag as self-closing.

Does Opera implement this part of the algorithm?  My recollection is that if I were to convert my markup to not assume that SVG elements with a trailing U+002F SOLIDUS character are treated as self-closing tags that Opera would render this page as Chrome does.

Alternately, are you asserting that Chrome/WebKit renders this page incorrectly today?

Posted by Sam Ruby at

Sam,

You’re quoting the Writing HTML documents section, which is just describing/defining the HTML syntax.

If you go look into the Parsing HTML documents section, you’ll find out that the only thing that triggers in foreign content tree builder insertion mode is a math element (as SVG has been –temporarily– removed).

So let’s take a look closer:
Your svg start tag (from the Moonlight image) is treated as any other start tag (in the in body insertion mode AFAICT). Same with the circle tag, which is considered a phrasing element. The next start tag ("g") after the “circle” self-closing tag will then create an element as a child of the circle element. Same with path (as a child of g). Then when the g end tag is processed (as any other end tag in the in body insertion mode), it loops once (node=current node=path, which is a phrasing element) and on the second iteration (node=g) it first generates implied end tags (this algorithm won’t do anything (current node=path)) and then pops the path and the g elements. Same goes when the svg end tag is processed (popping circle and svg).

This means that the rest of the article should be a sibling of the svg element per spec; …and that Opera is wrong.

But I’m talking here about Opera 9.62; maybe Anne is talking about the latest snapshot (yet to be relased)?

Posted by Thomas Broyer at

The idea is to refcount objects provided by the host environment. These are: local names (nsIAtom*), strings (nsString*) and element nodes (nsIContent*). (Well, nsString* is a bit special but that’s not relevant here.) It should be easy to refcount e.g. PyObjects instead. I’ve deliberately made the types pluggable in the C++ translator to make it easier to use non-Gecko types. (Indeed, the specific case I have had in mind has been Python.)

The allocation side may be more problematic. I’m currently assuming the availability of a no-fail allocator, so I haven’t prepared traditional code paths for checking each allocation for failure. I’m not sure how big of an issue this assumption would be when writing a C++ module for Python or Ruby.

Then there’s the issue of which UTF to use. Java, JavaScript and Gecko all use UTF-16 in their DOM (or DOM-like) APIs, so the parser now assumes that the document tree wants UTF-16. Python has the problem of not having a stable Unicode string type: it can be UTF-16 or UTF-32 depending on how the interpreter was compiled. That’s really, really bad. (I wish Python settled on UTF-16 for compatibility with Jython, IronPython and PyObj-C, but that’s not the direction Python seems to be taking.) Anyway, to make the parser work on UTF-32 instead, alternative code paths would be needed for numeric character reference decoding and the table of named characters would need to have different data. Not a big deal.

With UTF-8, is that the additional issue that U+FFFD and U+0000 have different lengths in UTF-8, so the tokenizer would you need to defer to the I/O driver for the replacement operation. The tokenizer already does this for LFs following CR, so the status flag would need to be tri-state instead of a boolean.

As for C#, I would expect to the code to compile on IKVM.NET already, but I have not tried it. The main effort would be writing a .NET IO driver (and a tree builder subclass) to make the parser feel like a .NET library. Of course, it would also be feasible and relatively easy to mechanically translate the source of the parser core into C# to avoid a dependency on IKVM.NET. Supporting C# would be off-focus from the point of view of Validator.nu or Gecko development, so I haven’t explored it properly.

The validator doesn’t accept SVG and MathML in text/html yet, because I want to discourage people from creating content ahead of corresponding browser features, because the content created might hinder the introduction of those browser features if the proactive legacy turned out to be subtly incompatible. (I should probably add at least warnings for section, article, details, etc.)

Posted by Henri Sivonen at

My apologies, I didn’t actually verify what was happening. Does indeed seem like something we need to fix.

Posted by Anne van Kesteren at

Henri:

It sounds like Python’s unicode type might be a better fit than Python’s string type?  Indeed that’s an area that will significantly change in the upcoming Python 3.  Similarly, Ruby 1.9/2 improves unicode handling.

no-fail allocation could be implemented with longjmp.

Re: off-focus, this could be an opportunity for somebody else to scratch this itch.

While I endorse the idea of producing warnings for things that are not widely implemented yet, in the context of a spec that isn’t complete and browsers who are rapidly adding features (example: section is just fine in Firefox 3, just not Firefox 2), this will be a very hard line to draw.  Especially once you factor IE into the equation — even if IE were to implement HTML5 today (not likely), it will be several years before that version is widely deployed.

You might also consider collapsing repeated mention of the same issue, particularly for warnings.

Anne:

It looks like I misdiagnosed the the issue in any case.  Self-closing tags are not the issue, fact that the closing </svg> tag doesn’t close out any nested open tags is.  In any case, given that the issue is now known, I’m confident that it will be addressed.

Posted by Sam Ruby at

It sounds like Python’s unicode type might be a better fit than Python’s string type?

The problem is that Python 2.x doesn’t have a stable Unicode type. It’s bad that the meaning of Python programs and the C interfaces change depending on how the Python interpreter was compiled?

Indeed that’s an area that will significantly change in the upcoming Python 3.

Does Python 3 have a Unicode type that behaves the same way on Mac OS X and Debian?

Similarly, Ruby 1.9/2 improves unicode handling.

Is there a Unicode string type that is locked to one of UTF-8/16/32?

no-fail allocation could be implemented with longjmp.

Thanks. I need to take a more careful look at longjmp. (I’ve never used it.)

Re: off-focus, this could be an opportunity for somebody else to scratch this itch.

Sure.

section is just fine in Firefox 3, just not Firefox 2)

section parses acceptably in Firefox 3.0.x, but Firefox 3.0.x doesn’t have a selector for styling based on the section-induced outline depth and it doesn’t do anything special with section for accessibility API exposure. So while section doesn’t break rendering in Firefox 3, I wouldn’t say it’s supported, either.

Posted by Henri Sivonen at

HTML 5 Gecko Build

Henri Sivonen has posted an exprimental Gecko build that parses HTML 5: The level of quality is “It runs and some pages render!” This build is not at all suitable for normal browsing use. Please don’t use it with your usual Firefox profile. There...

Excerpt from Ajaxian » Front Page at

The problem is that Python 2.x doesn’t have a stable Unicode type. It’s bad that the meaning of Python programs and the C interfaces change depending on how the Python interpreter was compiled?

Sounds like a job for #ifdef.

Does Python 3 have a Unicode type that behaves the same way on Mac OS X and Debian?

I misinterpreted what you were saying.  Oversimplifying a bit, what was the Python 2 unicode type is now the Python 3 string type — with the same compile type switch that is causing you grief.

Is there a Unicode string type that is locked to one of UTF-8/16/32?

In Ruby 1.9/2, the encoding of a string is not decided at compile time, but is an attribute that you can set at run-time.  So you could pick UTF-16LE, for example and go with it.  What’s even better is that the methods such as length and array indexing operate on characters instead of bytes, so you could pick UTF-8 and not have to worry about the fact that different characters may require different lengths to encode.

Posted by Sam Ruby at

As you keep talking about a C# HTML5 parser, as a reminder, I used to work on such a thing some time ago.
Nobody ever jumped in to help or even show interest.
Now that the parsing section of HTML5 seem to be more stable (since a few months actually), I think I should really get back and revive it.

But Twintsam is definitely not an abandonned project.

Posted by Thomas Broyer at

But Twintsam is definitely not an abandonned project.

Just to be clear, neither is html5lib; but a Python backend to htmlparser would be sweet.  Either a pure python one, or (and even better) a C/C++ library making use of Python internals for strings and memory management and the like and with a Python binding.

Ditto for Ruby, PHP, Perl, etc.  This is something I could possibly help with.

Posted by Sam Ruby at

Looks fine in Opera 10 alpha.

Posted by porneL at

Where fine means “no SVG” and “no border radius”.  Of course, that’s entirely OK, they don’t advertise that these two functions work.

I will say that it was a breeze to install on Ubuntu, once I figured out what version of gcc I have (4) and what version of qt I have (also 4).

Posted by Sam Ruby at

Parser HTML5 experimentálně i v Mozille

O několika HTML5 parserech jsem se již zmiňoval (viz příspěvky se štítkem parser ). Henri Sivonen vytvořil experimentální build Mozilly používající právě HTML5 parser . Jedná se o zajímavou ukázku. V době, kdy žádný prohlížeče neobsahuje HTML5...

Excerpt from HTML 4 5 6... at

Links for 2008-12-03 [del.icio.us]

Secret Santa Need to do a Secret Santa? We did. HTML5 in Gecko Really exciting! Henri Sivonen: The effort of putting an HTML5 parser inside Gecko takes a step out of the vaporware land. Give me my tools and back off "A good UI can balance a...

Excerpt from techno.blog("Dion") at

Kaazing Founder Jonas Jacobi to speak on HTML 5, Comet, WebSockets, Future of the Web

Bangalore, December 10, 2008: Web applications have traditionally been seen as second tier citizens in our network infrastructure, not capable of fully participating in the back-end message infrastructure due to its stateless architecture. The founder of Kaazing, Jonas Jacobi, is coming this summer to India’s biggest summit for the developer ecosystem - Great Indian Developer Summit ([link]) to speak on HTML 5 WebSockets, the one innovation, in particular, that will enable full-duplex HTTP communication, and finally bring an end to the tired ‘click and wait’ paradigm traditionally associated with the Web, and allow browsers to become first class citizens in our network.

The author of ‘Pro JSF and Ajax: Building Rich Internet Components’ says that prior to the introduction of WebSockets, bi-directional browser communication has been an elusive beast. Attempts to address this gap in the Internet architecture has circled around server-initiated message deliver or “push” techniques, commonly known as Comet or ReverseAjax, and typically achieved with an astonishing assortment of browser hacks. But with the emerging standards outlined in the HTML 5 specification developers can now take advantage of a full-duplex communications channel that operates over a single socket. More specifically, WebSockets enable browsers to open a socket connection to any TCP-based back-end service (for example, JMS, JMX, IMAP, Jabber, and so on) allowing you to easily create applications such as Web-based chat, and online trading, betting, and collaboration.

The 16-year software veteran, who previously worked at Brane and Oracle, will also speak on the future of the Web, Web technologies, address the importance of browser support of the HTML 5 standard, and offer insight into the key role developers' play in HTML 5’s proliferation and the impact on the end users.

About Great Indian Developer Summit

Great Indian Developer Summit, produced by Saltmarch Media ([link]) is the biggest gathering of software developers from Java/J2EE, Microsoft computing technologies, Rich Internet Applications (RIA), Web 2.0, Ajax, Agile, SOA, and Enterprise IT. For both veterans and newcomers to the world of .NET, Java, and the Rich Web, the Great Indian Developer Summit provides participants with a well-balanced learning experience that guaranteed they went back with a richer understanding of the technologies that make a difference to their careers. See the GIDS 2008 Red Stripe Report:
[link]

Over 3000 qualified and talented delegates attended GIDS 2008 - Source, The Hindu - Monday, 26 May 2008 ([link]). With outstanding educational sessions,
powerhouse speakers, a high-profile award ceremony, GIDS 2009 will feature premium knowledge, action plans and advise from been-there-done-it veterans, creators, and visionaries.

For further information on GIDS 2009, please visit the summit on the web [link].

A Saltmarch Media Press Release
E: info@developersummit.com
Ph: +91 080 4005 1000

Posted by Shaguf at

Add your comment