It’s just data

Stripping Styles

Nick Bradbury: Most RSS aggregator developers (myself included) tackled this problem by completely removing all styles from feed content. Since then, I’ve experimented with stripping only “unsafe” CSS from feeds, and despite Adrian’s claim that doing so requires a lot of work, it’s actually quite easy to do

First, here’s a use case.  Looks much better with style, doesn’t it?

Second, it would be helpful if aggregator authors could share their ideas (or at least point to them) from one place.  I suggest here.


Thanks for the pointer, Sam - I just created an account on that wiki.  And yes, that Wikipedia feed is far better with styles intact.

Posted by Nick Bradbury at

There’s one type of feed consumers (client-side aggregators) that can benefit from that styles easily because of their sandbox rendering, but what about server-side aggregators where style would/can disrupt the entire user experience? Are there any known tricks to sandbox css?

Posted by Elias Torres at

Elias: would you consider Planet Intertwingly a server side aggregator?  I seeded that wiki page with the sanitization rules that Venus implements.

Posted by Sam Ruby at

A set of unit tests would be incredibly useful for implementing those rules. There are already the html sanitisation tests as part of feedparser, but I believe it strip all style tags. I think this would be an interesting addition to my Eddie parser at some point.

The other biggest problem I have noticed is with people who blog about youtube videos or other flash content. Currently these get stripped from the content. I’m not entirely sure how to resolve this problem. Whitelisting isn’t entirely a scalable approach.

Posted by JD at

A set of unit tests would be incredibly useful for implementing those rules.

Here’s a few to start with:

Posted by Sam Ruby at

I’m not entirely sure how to resolve this problem. Whitelisting [flashlets] isn’t entirely a scalable approach.

(a bit tongue in cheek) /me has it solved in a very general way by running Flashblock and whitelisting selectively on the fly. It works quite well, and given how bad power management does flash under linux, it saves quite a lot of batteries.

Posted by Santiago Gala at

Response: On Stripping Styles for Security

Adrian Sutton blogs about the lack of CSS support in RSS aggregators, and concludes: "There has been a huge push in recent years to move away from the old habits of early HTML and to leverage CSS for presentation - the fact that it doesn’t...

Excerpt from Nick Bradbury at

There are already the html sanitisation tests as part of feedparser, but I believe it strip all style tags.

Nightly builds preserve a limited number of styles.  Associated test cases are [link] (scroll to the bottom for the “style_*.xml” files).

Posted by Mark at

Sam, PI would definitely count as a server-side aggregator.

Posted by Elias Torres at

Are there any known tricks to sandbox css?

IFrames?

Posted by Aristotle Pagaltzis at

I followed Bloglines' approach and permitted a whitelist of inline styles, then feed authors couldn’t use classes defined in an external style sheet.

What’s wrong with the following approach?

1 Hash the GUID.
2 Add the hash as a class attribute to each article appearing on a page.
3 Parse the external stylesheets for each article.
4 Prepend a class selector targeting the GUID hash class to each selector.
5 Concatenate the results to a master stylesheet, which is served with the page.

Posted by Jim at

Are there any known tricks to sandbox css?

A few things to try:
1. only allow styling to be introduced via style attributes.

2. use a CSS parser to process the rules found in <style> elements (or external style sheets if you want), work out which elements they apply to, and add appropriate style attributes.  This basically converts the document into one that would pass (1).

3. put the content inside an element with an ID unused anywhere else (roughly similar to the process of picking a boundary string for MIME messages).  Process all style rules, rewriting selectors like “foo” to “#UNIQUE foo” so that they can only affect elements in that region of the page.

Of course, you’ll also want to limit what sort of style attributes are allowed as Sam says: you don’t want individual entries using absolute positioning or viewport relative positioning on a Planet-style site.

Posted by James Henstridge at

you don’t want individual entries using absolute positioning or viewport relative positioning on a Planet-style site.

If you stick each post in an iframe of its own, you can let them all run wild – they will be unable to affect each other, even with absolute positioning.

Posted by Aristotle Pagaltzis at

Jim’s suggestion has the flaw that it doesn’t use CSS' selector power (instead of prepending all entries with the same hash, a unique ID could be set on a parent element and each entry could be selected with #unique-id .article). James' suggestion sounds allright, though.

I think what we need is a set of CSS sanitation rules. Parsing the CSS has become increasingly easy with the availability of CSS parsing libraries in different languages and then it all comes down to which styles to apply and which to ignore.

[...] you don’t want individual entries using absolute positioning or viewport relative positioning on a Planet-style site.

Absolute positioning is, imo, completely okay if it’s relative to a container within the entry. An easy way to compartementalize this is to set .entry { position: relative; }. As explained in CSS 2.1: “If the element has ‘position: absolute’, the containing block is established by the nearest ancestor with a ‘position’ of ‘absolute’, ‘relative’ or ‘fixed’ [...]”.

If you stick each post in an iframe of its own, you can let them all run wild – they will be unable to affect each other, even with absolute positioning.

Iframes isn’t really a good solution. It’s an easy solution, but not that great. It’s difficult to set the correct height of an iframe according to the content it’s including, for instance. Especially if the user adjusts the font size of his/her browser to something else than default.

Posted by Asbjørn Ulsberg at

Absolute positioning is, imo, completely okay if it’s relative to a container within the entry.

You can’t contain the absolutely positioned elements inside the element they are relative to. They can still be positioned to overlap other parts of the page.

It’s difficult to set the correct height of an iframe according to the content it’s including, for instance. Especially if the user adjusts the font size of his/her browser to something else than default.

Only if you have a problem with Javascript solutions. Otherwise you can easily use that to ask the embedded window for the height of its content. There are also hacks to detect font size changes from Javascript, whereupon you just ask the embedded window for its new content height.

It’s not a great user experience, mind – things will shift and bob about and scrollbars will go wonky. But in terms of sandboxing CSS styles, the approach is 100% secure, very easy to implement, very easy to implement correctly, and doesn’t put any restriction on permissible styles.

Posted by Aristotle Pagaltzis at

And to think Sam considers me a complicator. Can anyone here point me to a single feed that uses an external stylesheet or style element? If not, it seems to me all this fantastic theorizing is a complete waste of time.

Posted by James Holderness at

You can’t contain the absolutely positioned elements inside the element they are relative to. They can still be positioned to overlap other parts of the page.

What’s wrong with position:relative; overflow:hidden; on the container? Or am I missing something?

Posted by Robin at

You can’t contain the absolutely positioned elements inside the element they are relative to. They can still be positioned to overlap other parts of the page.

With overflow: hidden; you’d get enough protection, no?

Only if you have a problem with Javascript solutions. Otherwise you can easily use that to ask the embedded window for the height of its content. There are also hacks to detect font size changes from Javascript, whereupon you just ask the embedded window for its new content height.

I feel that this is such a hacky solution resulting in a bad user experience coupled with even worse accessibility that it’s not the solution I’d pick at least.

But in terms of sandboxing CSS styles, the approach is 100% secure, very easy to implement, very easy to implement correctly, and doesn’t put any restriction on permissible styles.

Correct, but then the question must be asked: Is the goal to preserve all styles? If not, then an iframe may even complicate matters since it adds another layer of rendering to the solution and filtering out the styles you don’t want has to be done anyhow.

Adrian Sutton points out that what we really want might be a way to harmonize the styles across several entries (possibly from several different sources), especially if they are being read as “a river of news” (all presented at once, as opposed to one at a time). The important thing is that there isn’t any data loss when styles are stripped, not that all styles stay intact. At the same time, there should be consistency across all entries.

PS: I got the following error while trying to post this comment:

CGI Failure
traceback:Traceback (most recent call last):
  File "gateway.cgi", line 47, in ?
    identity.validate(dict(cgi.parse_qsl(os.environ['QUERY_STRING'])))
  File "/home/rubys/mombo/identity.py", line 55, in validate
    file = writeComment(session['parent'],title,body)
  File "/home/rubys/mombo/post.py", line 240, in writeComment
    raise Exception(message)
Exception: POST limit exceeded
Posted by Asbjørn Ulsberg at

I got the following error while trying to post this comment

Got your email, and was about to email you back, but it seems like the condition has cleared up.  Yes, POST limit exceeded is meant to indicate a rate throttle condition; as spammers vary uri and ip addresses, the conditions checked are complicated to describe, but I see nothing that would have triggered this.  Also, I do the check twice, once before proceeding at which point I can produce a reasonable warning/error, and once right before the message is committed in case the situation has changed or the code got to the write via another codepath.  To have gotten this message you would have had to pass the first check and fail the second.  Twice as I see you got the error when you tried to post again.

I will continue to investigate...

Posted by Sam Ruby at

Yes, I got the exception twice. And now as I’m posting this, I get the red painted “Placing multiple comments...” warning that I usually see when posting several successive comments to your blog. I’ll trust “If this does not apply to you, please feel free to post your comment”, post, and see how this goes.

Could you perhaps elaborate on the algorithm you use for this, what kind of techiques you employ, etc? It probably deserves its own post, though, but I would love to know the inner workings of your spam protection scheme.

Posted by Asbjørn Ulsberg at

I’ve posted on this multiple times, and the code is here.  Read the comments inside spamrank for details.

Posted by Sam Ruby at

Since this thread seems to have evolved into error reports, I should point out that I also got an error on my last post:

CGI Failure

traceback:Traceback (most recent call last):
  File "gateway.cgi", line 47, in ?
    identity.validate(dict(cgi.parse_qsl(os.environ['QUERY_STRING'])))
  File "/home/rubys/mombo/identity.py", line 57, in validate
    print 'Status: 302 Found\r\n%s\r' % str(cookie(session['url']))
  File "/home/rubys/mombo/identity.py", line 80, in cookie
    cookie['openid'] = id
  File "/var/tmp/python2.4-2.4-root/usr/lib/python2.4/Cookie.py", line 580, in __setitem__
    rval, cval = self.value_encode(value)
  File "/var/tmp/python2.4-2.4-root/usr/lib/python2.4/Cookie.py", line 667, in value_encode
    strval = str(val)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-13: ordinal not in range(128)

The post still succeeded though - i.e. when I manually went back to the page, the comment had shown up.

Posted by James Holderness at

James: should be fixed now.  Python 2.x’s handling of Unicode is a PITA.

Posted by Sam Ruby at

Sanitizing CSS: 10 Tips for Aggregator Developers

Earlier this week I wrote about sanitizing CSS , and I’ve been thinking about it a bit more. Like many RSS aggregators, for security and presentation reasons the current version of FeedDemon strips all inline styles before displaying a feed, and I...

Excerpt from Nick Bradbury at

Jim’s suggestion has the flaw that it doesn’t use CSS' selector power (instead of prepending all entries with the same hash, a unique ID could be set on a parent element and each entry could be selected with #unique-id .article).

Unless I’m misunderstanding you, that doesn’t work for the “river of news” scenario where entries from different feeds are interspersed.

Can anyone here point me to a single feed that uses an external stylesheet or style element? If not, it seems to me all this fantastic theorizing is a complete waste of time.

That argument applies to anything that is currently unimplemented.  Nobody wants to use something that doesn’t currently work.  Hardly anybody uses Python 3 yet, but that doesn’t mean that the work being done on it is “a complete waste of time”.

Posted by Jim at

I wrote a comment right after Robin to say that I was under the mistaken impression that overflow:hidden would not work for absolutely positioned elements – but “POST limit exceeded” ate the comment and I didn’t resubmit because I thought I’d cause a dupe. Oh well.

Posted by Aristotle Pagaltzis at

This post has made the front page of Google results for... stripping.

Posted by Sam Ruby at

There is very little difference in aggregating feeds and accepting arbitrary HTML input: only the input model of these two are different: traditionally users submit data via textareas, whereas aggregators go to the website and pull the data. Perhaps the fundamental difference is that in the second case, there are multiple representations of the data, one of which the content author handles himself. They know what they want, and they won’t necessarily want to help make the aggregator’s life easier.

I agree with Nick Bradbury’s assessment. The content producer must use a common language, usually semantic HTML, to ensure there message gets across. As an aggregator, we should recognize that plain text is not enough, and allow at a minimum some subset of useful HTML.

If we do decide to allow CSS, a questions like this may be posed: is font-size:99pt acceptable? And then, what if it is absolutely necessary to the meaning of the content? Is a background-image to an external website acceptable? If it is absolutely necessary to the meaning of the content? If it poses a privacy risk to the readers of the feed?

It can be done (my pet implementation is HTML Purifier, which treats each CSS property separately and doesn’t allow nonsensical definitions like border-width:gray;), but current implementations like Bloglines' are very naive, probably due to developer laziness and performance. Perhaps we’re asking too much of aggregators: CSS makes no distinction between semantic styling (is there such a beast?) and presentational styling and useless styling.

On the subject of sandboxing external styles, you’ll need to compiler-style mangle the class and id names, because they may interfere with styles defined by the containing aggregator presentation page. It would be very neat functionality though. Fortunately, CSS selectors aren’t so expressive as to be Turing Complete, but I would play it safe and parse them anyway and ensure they are not using browser hacks or anything of that nature.

As for position:absolute, who the hell uses position:absolute in their blog content? For that matter, who the hell uses position:absolute in any of their content? (I must admit, I have seen some very funky styling for Wikipedia articles, but position:absolute is almost always reserved for meta-data.)

A link to a list of all CSS properties and assessments of their safety can be found here: [link]

Posted by Edward Z. Yang at

Welcome back, Edward.  The link you provided has been added to the wiki page I created.  Better get used to it; that’s how I keep track of things, by putting them all in one place.

I don’t think any of us are lazy here, just hopelessly optimistic and perhaps a bit naïve.  In any case, I plan to periodically resync the feed validator’s lists from that page so it will warn people if they are using style properties or values that are likely to be stripped; that’s why I am encouraging people who are working on purifiers to work together.  I see you have updated the pages once, feel free to improve on it as you see fit.

Posted by Sam Ruby at

I get the red painted “Placing multiple comments...” warning that I usually see when posting several successive comments to your blog

Registered users (even users that check none of the notification checkboxes) will now see significant relief from this.

Posted by Sam Ruby at

Add your comment