It’s just data

HTML5 Sanitizer

A while back, I commented that I would likely backport Jacques’s sanitizer to Python.  I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).

My approach was slightly different.  I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.

Example usage:

require 'html5lib/sanitizer'
require 'html5lib/html5parser'
include HTML5lib
HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s

Other differences worth noting:

As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.

Update: Python version is now available: source, tests


Ah! Cool. I was think that, perhaps, I should do something like this.

Jacques’s library assumes iso-8859-1

Maybe some of the tests, which I cribbed from elsewhere, do. The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.

HTMLParser preemptively downcases element and attribute names.  XHTMLParser does not.  Pick your poison.

If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.

Now to roll this back into Instiki ...

Posted by Jacques Distler at

The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.

Look how I modified the assignment to val_unescaped

If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.

On the XHTMLParser TODO list is to lowercase elements and attributes, but only in the xhtml namespace.

Now to roll this back into Instiki

Hopefully shortly, all this will be available as a gem.

Posted by Sam Ruby at

Look how I modified the assignment to val_unescaped

I don’t think either of us did it right.

Non-breaking space (U+00A0) is, as you say, encoded as \320\240 in utf-8. But, then, the rest of the characters in the range U+0080 – U+009F are also encoded as \320\200 – \320\237 in utf-8.

Curse Ruby’s lack of Unicode support!

In any case, maybe I was being too lenient, allowing hT\320\240tP:// to match an allowed protocol. I should be just as happy dropping protocol names which include embedded control characters, tabs, non-breaking spaces, and the like.

So one could just as well drop that gsub() and get a slightly less lenient, but encoding-independent sanitizer.

Posted by Jacques Distler at

So one could just as well drop that gsub() and get a slightly less lenient, but encoding-independent sanitizer.

Unfortunately, the result would be more lenient.  " javascript:alert('foo')" would be allowed.

As this particular implementation derives from HTML5lib::HTMLTokenizer, it is safe to presume utf-8.  I’ve adjusted the range accordingly.  Thanks!

Posted by Sam Ruby at

Neat!

I started something similar that worked with the final tree but the freedom to choose a treebuilder that comes with altering the tokens is a very big win. Any plans to backport this to python or do I need to add it to my todo list?

Posted by jgraham at

[from ludoo] Sam Ruby: HTML5 Sanitizer

A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests)....

Excerpt from del.icio.us/network/becks77 at

Any plans to backport this to python

Done.  Noticed a small porting error in Jaques’s version that I corrected.  In Python “” is false, in Ruby it is true; this difference lead to a empty background-image: to appear in the platypus output... I changed it to be consistent with the Python version which deletes this property entirely.

From this point forward, it should not be hard to keep the test cases in sync.

Posted by Sam Ruby at

(Writing from a cyber cafe in Kenya...)

Sweet!  I should probably see about having the other people who are now working on FeedTools use this to replace the existing sanitization code, which really needs work.

Posted by Bob Aman at

Shouldn’t

require 'html5lib/sanitizer'
require 'html5lib/liberalxmlparser'
include HTML5lib

XHTMLParser.parseFragment(stream, :tokenizer => HTMLSanitizer).to_s

treat attribute names in a case-sensitive manner?

Posted by Jacques Distler at

With the code you have, this works:

puts HTML5lib::XHTMLParser.new.parse('<html><FOO/></html').to_s

If you pull the latest, this will now also work too:

puts HTML5lib::XHTMLParser.parse('<html><FOO/></html').to_s

(this also applies to the parseFragment methods)

Posted by Sam Ruby at

Cool! Thanks.

In return, tests for case-sensitive elements and attributes.

(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)

Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.

Posted by Jacques Distler at

In return, tests for case-sensitive elements and attributes.

Committed.  If you are interested, I will add you as a committer.

(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)

test case?

Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.

It seems to me that if HTMLTokenizer::initialize simply set @entities = ENTITIES, and both provided an appropriate accessor and made use of it,  XHTMLParser could set @tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES) or some such.

Just a thought.

Posted by Sam Ruby at

If you are interested, I will add you as a committer.

Works for me either way. (The good thing about BZR, as I learned from you, is that no such decision is necessary.)

(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)

test case?

I believe the tests should read:

--- HTML5lib/tests/test_sanitizer.rb.orig 2007-05-26 09:31:16.000000000 -0500
+++ HTML5lib/tests/test_sanitizer.rb     2007-05-26 09:43:28.000000000 -0500
@@ -21,11 +21,15 @@
     next if %w[caption col colgroup optgroup option table tbody td tfoot th thead tr].include?(tag_name) ### TODO
     define_method "test_should_allow_#{tag_name}_tag" do
       if tag_name == 'image'
-        assert_equal "<img title=\"1\"/>foo &lt;bad&gt;bar&lt;/bad&gt; baz",
+        assert_equal "<img title=\"1\">foo &lt;bad&gt;bar&lt;/bad&gt; baz",
           sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
+        assert_equal "<img title=\"1\"/>foo &lt;bad&gt;bar&lt;/bad&gt; baz",
+          sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
       elsif VOID_ELEMENTS.include?(tag_name)
-        assert_equal "<#{tag_name} title=\"1\"/>foo &lt;bad&gt;bar&lt;/bad&gt; baz",
+        assert_equal "<#{tag_name} title=\"1\">foo &lt;bad&gt;bar&lt;/bad&gt; baz",
           sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
+        assert_equal "<#{tag_name} title=\"1\"/>foo &lt;bad&gt;bar&lt;/bad&gt; baz",
+          sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
       else
         assert_equal "<#{tag_name.downcase} title=\"1\">foo &lt;bad&gt;bar&lt;/bad&gt; baz</#{tag_name.downcase}>",
           sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")

It seems to me that if HTMLTokenizer::initialize simply set @entities = ENTITIES, and both provided an appropriate accessor and made use of it,  XHTMLParser could set @tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES) or some such.

That might be useful to some people. For me, it was just as easy to change

sanitize_xhtml(text).to_ncr

to

sanitize_xhtml(text.to_ncr)

.

Posted by Jacques Distler at

-        assert_equal "<img title=\"1\"/>foo &lt;bad&gt;bar&lt;/bad&gt; baz",
+        assert_equal "<img title=\"1\">foo &lt;bad&gt;bar&lt;/bad&gt; baz",

Let’s be clear what is going on here.  The default treebuilder is (currently) rexml, so what is happening here is a REXML::Document is being produced, and then to_s is being called on that document.

Posted by Sam Ruby at

Understood.

All I’m saying is that REXML is, perhaps, not the best serializer for producing HTML5 (as opposed to XHTML5) output. It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.

This is, I’ll agree, not an issue for the Sanitizer itself.  But it’s still an issue.

Posted by Jacques Distler at

It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.

That’s new.  Cool.  Must port.

Posted by Sam Ruby at

HTML5 Sanitizer - ruby and pythong code

[link]...

Excerpt from del.icio.us/jemisa at

It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.

That’s new.  Cool.  Must port.

Another thing about the Serializer. It should probably treat HTML5 VOID elements (outputting “<br>” in the HTML5 and “<br />” (note the space) in the XHTML5 serialization) differently from empty elements in other namespaces (outputting them as, e.g. "<none></none>"). That’s the way the SVGfix MovableType plugin works.

Posted by Jacques Distler at

<br />” (note the space) in the XHTML5 serialization

Just curious, but why?  I’ve seen this convention before, but haven’t been able to find a reason behind it.  Oddly the W3C validator seems to accept “<br/>” (no space) as valid HTML4.  It also is valid HTML5.

Posted by Sam Ruby at

Just curious, but why?

That’s what Appendix C says.

Oddly the W3C validator seems to accept “<br/>” (no space) as valid HTML4.

Just because it’s “valid” (when checked by an SGML parser) doesn’t mean it’s interpreted correctly in legacy HTML UA’s.

I don’t know which (if any) current HTML UAs would stumble over “<br/>”. Perhaps this bit of Appendix C is now obsolete.

Posted by Jacques Distler at

Just a note about the HTMLSerializer in html5lib: you should be theoretically able to have an XHTML5 serialization by setting the options to quote_attr_values=True, minimize_boolean_attributes=False, use_trailing_solidus=True, omit_optional_tags=False.

The default options should produce a very compact HTML5 serialization with unquoted attribute values when possible, minimized boolean attributes, omitted optional tags and void elements without a trailing solidus. I was wondering whether a whitespace-stripper could be added to remove “unnecessary” whitespace (whitespace between <table>, <tr> and <td>, collapsing multiple whitespace into a single space character, etc.) This would need to be disabled by default to avoid conflicts with eventual white-space: preserve in CSS.

Posted by Thomas Broyer at

Thomas: you’ve been busy!

I’ve been waiting to port it while the code was in flux.  Is the code either (a) near stable at this point, or (b) if there were a functioning Ruby equivalent, would you be willing to help maintain both versions?

Posted by Sam Ruby at

As an irrelevant aside, every time I see the title “HTML5 Sanitizer” (which is often, because I’m subscribed to the comments feed), my brain wants to read it “Hand Sanitizer” for some reason.

Posted by Aristotle Pagaltzis at

Sam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree) instead of the current for token in TreeWalker().walk(tree) pattern).

Apart from the above change, the tree walkers API is near stable (it’s just a matter of generating the same tokens as those from the HTMLTokenizer). I’d accept any proposition regarding the tree walkers implementations though ;-)

I’ve just changed the HTMLSerializer options and added a few new features (injecting a <meta charset="..."> in the <head> –not yet plugged-in–, stripping “unnecessary” whitespace –plugged-in but a no-op at the moment–, choosing the “best” quote char for each attribute value –done, needs tests–) but the implementation is now quite stable (optional tags are a port from genshihtml5 with 100% code coverage).

Wrt the Ruby port, I have no knowledge at all in Ruby (and don’t really want to learn about it for now), so I’m afraid I’ll be unable to help.

Posted by Thomas Broyer at

Sam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree) instead of the current for token in TreeWalker().walk(tree) pattern).

OK... I committed that change... let me know what you think...

Posted by Sam Ruby at

OK... I committed that change... let me know what you think...

Heck, you missed test_treewalkers.py ;-)

Also, I’m not sure about it but how about making __iter__ = walk instead of

def __iter__(self):
    for token in self.walk():
        yield token

It would prevent making a new no-op generator. Would there be any drawback?

Anyway, the “<meta charset>” injector is now plugged-in and activated as soon as you given an encoding to serialize and the inject_meta_charset option is true (default value). It just needs unit tests ;-)

Posted by Thomas Broyer at

Heck, you missed test_treewalkers.py ;-)

Oops!  I must have done the commit from the src directory as I had verified my changes.  Thanks!

Also, I’m not sure about it but how about making __iter__ = walk

Done.  Not that it matters much, but I actually did it the other way around as I gather that __iter__ is the preferred entry point.

Posted by Sam Ruby at

Hmmm. I haven’t tried the Python version, but the Ruby version definitely has problems with astral-plane Unicode characters.

Test:

def test_should_handle_astral_plane_characters
  assert_equal "<p>\360\235\222\265 \360\235\224\270</p>",
    sanitize_xhtml("<p>&#x1d4b5; &#x1d538;</p>")
end

I’ve gone back to my previous sanitizer. :-(

Posted by Jacques Distler at

Jacques, re your comment from Mon 28 May 2007, the serializer’s input is a list of “tokens” (the same that would be produced by the tokenizer, as defined in the HTML5 spec) so there’s no notion of namespace. Wrt empty elements, the serializer uses the void elements list and does not look at the element’s content at all, so a <none> element would always be output as <none></none>.

To serialize a tree, you pass the serializer a treewalker. Currently, treewalker implementations for namespace-aware trees (DOM, PullDOM and ElementTree) expect the elements to be in the null namespace. This could be changed, we need unit tests for those kind of things to catch for potential bugs (I’m sure we’d have weird results with such trees). Actually, unit tests for treewalkers need refactoring: we should make specific tests rather than using the parser tests.

/me: we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)

Posted by Thomas Broyer at

the Ruby version definitely has problems with astral-plane Unicode characters.

Fixed.  Thanks!

we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)

Something like this should work:

Index: src/sanitizer.py
===================================================================
--- src/sanitizer.py    (revision 639)
+++ src/sanitizer.py    (working copy)
@@ -2,7 +2,7 @@
 from xml.sax.saxutils import escape, unescape
 from tokenizer import HTMLTokenizer
 
-class HTMLSanitizer(HTMLTokenizer):
+class HTMLSanitizer(object):
     """ sanitization of XHTML+MathML+SVG and of inline style attributes."""
 
     acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b',
@@ -119,6 +119,19 @@
     allowed_svg_properties = acceptable_svg_properties
     allowed_protocols = acceptable_protocols
 
+    def __init__(self, stream, encoding=None, parseMeta=True):
+        if not hasattr('__iter__', stream) or encoding or not parseMeta:
+            self.source = HTMLTokenizer(stream, encoding, parseMeta)
+        else:
+            self.source = stream
+
+    def getstream(self):
+        try:
+            return self.source.stream
+        else:
+            return None
+    stream = property(getstream)
+
     # Sanitize the +html+, escaping all elements not in ALLOWED_ELEMENTS, and
     # stripping out all # attributes not in ALLOWED_ATTRIBUTES. Style
     # attributes are parsed, and a restricted set, # specified by
@@ -131,7 +144,7 @@
     #   sanitize_html('<a href="javascript: sucker();">Click here for $100</a>')
     #    => <a>Click here for $100</a>
     def __iter__(self):
-        for token in HTMLTokenizer.__iter__(self):
+        for token in self.source:
             if token["type"] in ["StartTag", "EndTag", "EmptyTag"]:
                 if token["name"] in self.allowed_elements:
                     if token.has_key("data"):

Needs tests.

Posted by Sam Ruby at

Hello World

My blog is full of bombs and or keys . It has been a very entertaining and busy two weeks. In screencasts DHH shows us how easy it is to create a web log using the rails framework. In fact, it only takes 15 minutes. If only . Building my blog took...

Excerpt from Sam's Spot at

Hello World

My blog is full of bombs and or keys . It has been a very entertaining and busy two weeks. In screencasts DHH shows us how easy it is to create a web log using the rails framework. In fact, it only takes 15 minutes. If only . Building my blog took...

Excerpt from Sam's Spot at

Add your comment