It’s just data

Blogging with Style

Joe Friend: For example we are encoding smart quotes incorrectly so I had to turn off that feature in Word, but the goal is to output just what is needed to make your blog post clean and readable (code and rendered HTML).

Cool!  There’s hope yet.  ;-)

On a somewhat related note, I’m investigating to see if there is a simple set of checks which could be made to enable style attributes to pass safely through feeds.  Previous recommendations were to strip all style attributes.

My first pass at this came up with the following regular expression:

/^[-:,;# a-zA-Z0-9]*$/

Pros: it is simple to implement — it doesn’t even require regular expressions.  And despite its simplicitity, it seems like it would keep out the worst of the vermin (which seem to require parens).

Cons: one can still do some mischief with things like position:absolute.  To address that does require a bit deeper parsing, but not too bad.  Looking at what exists in style attribute values on the web today, the majority is very simple.  I don’t even see quotes in use.  Anything more difficult to parse should be stripped.

The goal is to enable people who want to use Rich Text Editors like the ones that are found in recent versions of IE and Firefox.  And, perhaps someday, one could even use a suitably housebroken version of Word. ;-)

If we can come up with a profile for safe style attribute usage that is relatively easy to parse, I can work to get this into the Feed Validator and the Universal Feed Parser.


What about setting the color of text? That would cause problems on sites that use a similar color as background color.

Posted by Sjoerd Visscher at

At the moment, the Universal Feed Parser does not strip <font color="#FFFFFF">, which would have similar issues.

If you look at Joe’s feed at the moment, you will see:

&lt;SPAN style="TEXT-DECORATION: line-through"&gt;strikethrough&lt;/SPAN&gt;

There is no question that they should be using the del tag instead here, but leaving that aside for the moment, stripping style attributes in contexts such as these changes the meaning of what the reader ultimately sees.

That’s what I am looking to address.

Posted by Sam Ruby at

<rss>
<item>
<description>&lt;span style="&amp;#97;&amp;#110;&amp;#121;&amp;#58;&amp;#32;&amp;#101;&amp;#120;&amp;#112;&amp;#114;&amp;#101;&amp;#115;&amp;#115;&amp;#105;&amp;#111;&amp;#110;&amp;#40;&amp;#119;&amp;#105;&amp;#110;&amp;#100;&amp;#111;&amp;#119;&amp;#46;&amp;#108;&amp;#111;&amp;#99;&amp;#97;&amp;#116;&amp;#105;&amp;#111;&amp;#110;&amp;#61;&amp;#39;&amp;#104;&amp;#116;&amp;#116;&amp;#112;&amp;#58;&amp;#47;&amp;#47;&amp;#101;&amp;#120;&amp;#97;&amp;#109;&amp;#112;&amp;#108;&amp;#101;&amp;#46;&amp;#111;&amp;#114;&amp;#103;&amp;#47;&amp;#39;&amp;#41;"&gt;&lt;/span&gt;</description>
</item>
</rss>

Works in IE.

Give up yet?

Posted by Mark at

Oh never mind, you’re stripping ampersands.

Posted by Mark at

Unobfuscated:

<span style="any: expression(window.location='http://example.org/')"></span>

Cute.

As you point out, such a style attribute would be stripped entirely as my premise is that all the common usages I have seen are easy to parse; so outright rejecting anything beyond the most basic vocabulary would not cause any grief.

Posted by Sam Ruby at

The more depressing thing is that IE parses it at all.  As in, that they parse JavaScript in CSS.  Then again, Firefox has -moz-binding to worry about, which (IIRC) was the subject of a recent high-profile exploit (I don’t want to speculate where because I may be misremembering the details).

Posted by Mark at

Today's links [May 12, 2006]

Sam Ruby: Blogging with Style Looking for “a simple set of checks which could be made to enable style attributes to pass safely through feeds” Feed Manager for Movable Type "a plugin that provides turn-key comment feeds for your Movable Type...

Excerpt from Blogging Roller at

Counter-examples found in the wild:

FONT-FAMILY: 'Lucida Console'
color: rgb(0, 0, 128)
tab-stops: list .5in
font-size:85%

Revised regular expression:

^([-:,;#%.\sa-zA-Z0-9]|'[\s\w]+'|"[\s\w]+"|\([\d,\s]+\))*$

Here’s a more complete survey of properties found.  Most notable is the existence of position:absolute in Ray Ozzie’s feed.

Posted by Sam Ruby at

All right, you’ve convinced me that you can detect potentially malicious CSS, albeit with some false positives.  This still doesn’t solve the problem of individual styles that would destroy (or seriously impair) the layout of a web-based (or HTML rendering engine-based) aggregator.  I’m thinking here about how an attacker could target a specific aggregator (perhaps with user-agent sniffing in PHP, or .htaccess rules) and serving maliciously-styled content to that aggregator, while other agents get normal (or no) styles.  Not that anyone would be inclined to do such a thing, but let’s hypothesize

You’ve already mentioned position: absolute, but some others come to mind:

etc.

Anyway, I just don’t see how an aggregator can expect to distinguish between good styles and annoying-to-the-point-of-generating-support-requests styles.  Maybe if they piggybacked on a full CSS parser (Sage could probably do this, since it could probably QueryInterface into the CSS parsing code in Firefox and parse stuff manually, then check the parsed CSSRuleSet or something).  Anyway, I don’t ever see this being worthwhile from a cost-benefit POV, especially against a targeted attack from a disgruntled producer.

Posted by Mark at

What I did with Sporkfed/FeedTools was to add per-feed configuration support to FeedTools, then allow the user to enable/disable HTMLTidy, sanitization, and CSS styling for each feed.  Default is obviously to have sanitization and CSS off.  Certainly though, another option for allowing clearly unmalicious styling through would be nice.

Posted by Bob Aman at

let’s hypothesize

My problem is that nothing you described couldn’t be done with font tags, <pre>, and a few thousand newlines.  Even fewer if you pick a really big font.

Heck, right now I’m getting a bit annoyed with the <br clear="all" /> that appears in Boing Boing’s feed.

And there already is the possibility of class name colisions...

Maybe if they piggybacked on a full CSS parser

In addition to the regular expression mentioned above, all the valid styles I have seen also conform to the following:

^(\s*[-\w]+:\s*[^:;]*(;|$))*$

Combined, this means that everything is property:value, where all quotes and parens are correctly paired, and never are nested.

I’m thinking that the UFP could permit the caller to define a set of filters... one set is passed elements, and another is passed individuale css properties.  Filters could return back the original content, modified content, or None.

This would allow callers to strip class names that are problematic for their site, and perhaps even to modify or eliminate tags that are not consistent with their page’s DOCTYPE.

In fact, the current set of sanitation rules could be recast as a default set of filters.

Posted by Sam Ruby at

“My problem is that nothing you described couldn’t be done with font tags, <pre>, and a few thousand newlines.”

How about this?

style="background:white; padding:99999em; margin:-99999em;"

Hypothetically, this would effectively wipe out everything that comes before the malicious <item> in the HTML source.

Add a position:relative;, and even a {{z-index:9999;}}, and everything after disappears too.

Posted by Már at

I’ve been talking to Andrew Begun and Joe Friend at Microsoft, and according to Andrew, it looks like a lot of the things we noticed that were wrong with the content on Joe’s site were either the result of Joe’s hand-made changes (such as the use of border=1 on his image elements) or the Community Server munging things after it got the content.  I explained the issue with the <del> element, and it looks like they’ll be updating the code to use the correct element instead.  So in reality, they may actually be doing a better job than we realized.  Kudos to them if that really is the case.

Posted by Bob Aman at

Chasing referers of warning/DangerousStyleAttr.html and rescanning the OPML top 100 produces the following additional use cases:

border:none !important; margin:0px !important;
line-height: normal
overflow: auto
TEXT-INDENT: 0.5in
vertical-align: bottom
vertical-align: top
white-space: nowrap

Noting that the only valid use of hyphens observed so far is in identifiers, a further refinement of the regular expression is possible:

^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'|"[\s\w]+"|\([\d,\s]+\))*$
Posted by Sam Ruby at

Noting that the only valid use of hyphens observed so far is in identifiers

background-repeat: repeat-x;

Posted by Mark at

background-repeat: repeat-x;

As distinct from repeat-y, which is a platypus.

Posted by Phil Ringnalda at

The playtpus hack was actually neither, which is to say both.  It used background-repeat: repeat, which does not fall afoul of this new rule.  (Other parts of the platypus do, so it would still be blocked overall.  Although I’m pretty sure you could accomplish something just as annoying with only a background color, not a background image, and it would “pass” Sam’s latest filter.)

Posted by Mark at

Not only are urls effectively blocked, so is background-repeat (but not background), and z-index.

However, after a little investigation, white listing css property values is possible:

acceptable_css_keywords = ['aqua', 'black', 'block', 'blue', 'both', 'bottom',
  'brown', 'center', 'fuchsia', 'gray', 'green', '!important', 'left',
  'lime', 'maroon', 'medium', 'none', 'navy', 'normal', 'nowrap', 'olive',
  'pointer', 'purple', 'red', 'right', 'solid', 'silver', 'teal', 'top',
  'transparent', 'underline', 'white', 'yellow']
valid_css_values = re.compile('''^(\d?\.?\d?\d(cm|em|ex|in|pt|px|%|,|\))?|#[0-9a-f]+|rgb\(\d+,\d*,?\d*\)?)$''')

if not prop.lower().startswith('font'):
  for keyword in value.lower().split():
    if keyword not in acceptable_css_keywords and not valid_css_values.match(keyword):
      return False

Undoubtedly, this list will grow a bit over time (for example, there are all sorts of obscure colors), but the goal isn’t to allow in the obscure or hard to parse styles; the above is sufficient to allow all of the OPML 100 and the 28 Feed Validator DangerousStyleAttr referers through — with the exception of Ray Ozzie’s position:absolute, and various MS-word specific properties.

Posted by Sam Ruby at

Whitelisting colors could drive a man insane.  (Don’t forget flavor.)

Posted by Mark at

Whitelisting colors could drive a man insane.

So... let’s just allow through the base and popular colors.  I have no problem penalizing the obscure or hard to parse.  The current HEAD of UFP will not allow through the following:

style="background: NavahoWhite;"

I’m not proposing to change that, but I am proposing to allow through:

style="background: #ffdead;";
style="background-color: NavahoWhite;"

Also, let’s remember that Platypus was annoying but not evil.  I’m sure that one could be sufficiently annoying with simply the height and width attributes of the image tag.  Allowing JavaScript through, however, would be enabling evil...

Posted by Sam Ruby at

I’m sure this is going to get shot down as being totally stupid, but what about ignoring any style that contains the strings javascript or expression?

Since I have a three pane aggregator (one message viewed at a time) I don’t care about obnoxious HTML like white on white or the platypus hack. I just want to stop people slipping through javascript. Will keyword blacklisting work? Are there other keywords I should check?

I understand that whitelisting would be better, but that would require more parsing than I’m willing to do. At least it’s clear what is being blacklisted with this method - your regular expressions seem too much like magic to me.

I should also point out that Mark’s use of entities to hide stuff won’t be a problem because my HTML parser will have already decoded that. What I’m dealing with is a pure unescaped attribute - I just need to know whether it is safe to accept it or not.

Posted by James Holderness at

If you want to do blacklisting to keep out all scripting, you will need to handle all the different ways that every browser allows JavaScript in attribute values.  These examples are taken from XSS cheat sheet but could easily be adapted for style attributes:

<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>

Not to mention...

[IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))"]

And, of course, any of these things in combination:

exp/*<XSS STYLE='no\xss:noxss("*//*");
xss:&#101;x&#x2F;*XSS*//*/*/pression(alert("XSS"))']

XSS cheat sheet has more, including some NN4-specific attacks, XML data islands, HTML+TIME, UTF-7, and other fun stuff.

I should also note that you should never try to “neuter” the style attribute by stripping out bad words like “expression”, because an attacker can use your algorithm against you:

[xss style="xss:exprEXPRESSIONession(alert('xss'))"]


Posted by Mark at

I just took a look at the CSS property table to find properties that I have used in the past that I believe don’t make the current whitelist criteria.  I leave it to you to decide whether any of these should be whitelisted:

border-collapse:collapse
border-style:dotted | dashed | double | groove | ridge | inset | outset
clip: rect(5px, 40px, 45px, 5px);
cursor:auto | crosshair | default | pointer | move | e-resize | ne-resize | nw-resize | n-resize | se-resize | sw-resize | s-resize | w-resize | text | wait | help | progress (I don’t know that there is browser support for ALL of these, but several at least)
display:inline
font-style:italic
font-variant:small-caps
font-weight:bold
list-style-type: disc | circle | square | decimal | decimal-leading-zero | lower-roman | upper-roman | lower-greek | lower-latin | upper-latin | armenian | georgian | lower-alpha | upper-alpha | none (I don’t know that there is browser support for ALL of these, but several at least)
overflow:auto | hidden | scroll
page-break-after:always
page-break-before:always
position:relative
text-decoration:line-through
text-align:justify
vertical-align:baseline | sub | super | text-top | middle | bottom | text-bottom
white-space:pre
z-index:auto (and I’m also pretty sure negative z-indexes are valid)

Posted by Kevin H at

Mark:

These examples are taken from XSS cheat sheet

Nearly all of those examples will be decoded and normalized by an HTML parser to an easy-to-parse form. Remaining trickery can be scrubbed by removing whitespace and such for the purpose of checking. (Best to use a character whitelist containing the kinds of characters required to express malicious intents.) No neutering; just wholesale rejection if a value trips a wire. Since there is no practical use for comments in a style attribute their presence is grounds for immediate rejection too. All these checks should be coarse, since false positives are not an issue.

I’m always skeptical of blacklists, but this approach might work. Scrub vigorously, keep it as dumb as possible, be trigger happy. Besides possible incompleteness of the blacklist there should be no attack vectors.

Posted by Aristotle Pagaltzis at

Besides possible incompleteness of the blacklist

Have you been watching the monthy bits of fun I’ve been having at BlogLines' expense?  What about @import?

Posted by Sam Ruby at

Thanks Mark. That list and that link have been very informative. I never would have suspected you could throw whitespace into the middle of scripting code like that. I still think this could work though, if you went with something like what Aristotle is suggesting. It may not be proof against all possible future attacks, but as long as it can handle all the known holes I’d be happy.

Posted by James Holderness at

Besides possible incompleteness of the blacklist there should be no attack vectors.

Yes, I can’t imagine what could possibly go wrong.

Posted by Mark at

Malcolm Tredinnick on Does CSS belong in RSS feeds? I’m kind of in the “no” camp. However, if truly stick my purist principles a conflict arises: I can’t use styles and I would prefer not to use deprecated elements (e.g. strike), so now I can’t strike out text, which is...

Excerpt from JeffCroft.com Comments | Does CSS belong in RSS feeds? at

CSS Security

There is an interesting blog post over at intertwingly.net discussing how to sanitize CSS for the purpose of making it acceptable for blog submissions to avoid cross site scripting. This is an interesting problem I’ve spent weeks on end...

Excerpt from ha.ckers.org web application security lab at

Interesting discussion.  While working on NeatHtml I developed a whitelist regular expression to include in the whitelist XML Schema that NeatHtml uses to validate untrusted HTML.  Here’s what I came up with

^(\s*(
(vertical-align|VERTICAL-ALIGN):\s*((text-)?(top|bottom)|middle|baseline|sub|super)
|(text-align|TEXT-ALIGN):\s*(left|right|center|justify)
|(text-decoration|TEXT-DECORATION):\s*(blink|line-through|overline|underline|none)
|(font-style|FONT-STYLE):\s*(normal|oblique|italic)
|(font-weight|FONT-WEIGHT):\s*(normal|bold)
|((mso-|MSO-)(fareast-|FAREAST-|ansi-|ANSI-|bidi-|BIDI-))?(font-family|FONT-FAMILY|language|LANGUAGE):\s*[a-zA-Z, &quot;'\-]+
|(font-size|FONT-SIZE):\s*(([1-9]|1[0-8])pt|([1-3]|[0-2]\.[0-9]*)e[mx]|xx-small|x-small|small|medium|large)
|(margin|MARGIN|padding|PADDING)(-(top|TOP|left|LEFT|bottom|BOTTOM|right|RIGHT))?:(\s*(0|[1-9][0-9]*)(.[0-9]*)?(pt|em|ex|in|px|cm)?)+
|(width|WIDTH|height|HEIGHT):\s*(0|[1-9][0-9]*)(.[0-9]*)?(pt|em|ex|in|px|cm|%)?
|(text-indent|TEXT-INDENT):\s*(0|[1-9][0-9]*)(.[0-9]*)?(pt|em|ex|in|cm)
|(mso-spacerun|MSO-SPACERUN):\s*(yes|no)
|((background|BACKGROUND)-)?(color|COLOR):\s*([\-a-zA-Z]+|#[0-9a-fA-f]{1-6}|rgb\([0-9, ]+\)))
\s*;?)*$

That should block javascript, the ‘position’ property, negative/huge margins/padding, and huge text, while allowing all the benign styles I’ve come across in posts to my forums.  It doesn’t prevent white-on-white and I really can’t think of a practical way to do that.

As for preventing HTML-based defacement (e.g. <pre> containing many lines or long lines), the best solution I’ve see in to place the HTML in a <div> that uses CSS to constrain the width and height.  That limits the area that the vandal can damage.  For an example, see the NeatHtml demo.

Posted by Dean Brettle at

Re: Blogging with Style

[link]...

Excerpt from del.icio.us/wearehugh at

Re: Blogging with Style

wearehugh : Re: Blogging with Style Tags : css javascript rss sanitizing security...

Excerpt from HotLinks - Level 1 at

Sam Ruby: Blogging with Style

[link]...

Excerpt from del.icio.us/tag/rss+parsing+ruby at

Malcolm Tredinnick on

Does CSS belong in RSS feeds? I’m kind of in the “no” camp. However, if truly stick my purist principles a conflict arises: I can’t use styles and I would prefer not to use deprecated elements (e.g. strike), so now I can’t strike out text, which is...

Excerpt from JeffCroft.com Comments | Does CSS belong in RSS feeds? at

Add your comment