Whitelisting

From time to time, the subject of whether to use whitelists or blacklists come up.  As an example, originally when Mark Pilgrim wrote How To Consume RSS Safely (way back in 2003!), he described a list of elements that needed to be blacklisted, and mentioned — almost in passing — that whitelisting may be a reasonable alternative.  Over time, Mark came to realize that there really isn’t any contest: A Whitelist is the best way to validate input.  It basically comes down to a sense of what kind of errors you are willing to tolerate.

Another context that this comes up in is the Feed Validator.  Originally, the Feed Validator employed a mix of strategies, but over time, I’ve been converting each and every one over to a white list strategy.

Why?  I can’t begin to list all the possible misspellings of isPermaLink, nor all of the possible places where itunes:category can not be placed.

This does mean that from time to time, people will notice that false positives do occur.  All I can say when such happens is that I’m sorry, and I will try to be responsive when people provide specific use cases.  But even then, every effort will be made to simply whitelist just those use cases and nothing more.  A relatively recent example involved the use of specific rdf elements in the context of an Atom feed.  This came up in September, and again this month.


First time I’ve seen RDF used in Atom like this, although it’s very much the kind of thing I personally had in mind when Atom extensions were under discussion (for my own stuff I wound up thinking more about RDF as Atom content, or just using GRDDL on the Atom doc as a whole).

Anyhow, quick question: the construct here looks a bit unusual:

<rdf:type>info:some-ns/type-for-feed</rdf:type>

I’m curious, would the RDF/XML style be legit?

<rdf:type rdf:resource="info:some-ns/type-for-feed" />

(RDF/XML would otherwise treat the object as a literal string)

On the general topic of whitelisting, there’s FOAFWhitelisting and FoafOpenid - ideas who’s time has come :-)

Posted by Danny at

And we see it again here.

When we’re talking about extension elements, I’d strongly recommend making this a warning instead of an error, call it “Questionable Use of Extension Element” with the following description:

Your feed contains an element that this validator does not recognize in this context... This may just be a typo. Element names are case-sensitive; make sure you’re using the right case. For example, pubDate has a capital “D”... This may simply be a case of an element being placed in the wrong context or apparently being used for purposes other than what it was originally intended. For example: itunes:category can only be placed inside of the channel element, it does not belong inside an item

(note that this does not cover trying to validate something that is not a feed, which the current error covers)

Posted by James Snell at

James, we’ve had this discussion before, and undoubtedly we will have it again.

From my point of view, every element in an Atom feed is in a namespace, and the use and possible position of that element is defined by the appropriate spec.  And while, no, I can’t rule out a future RFC 4685bis defining a “thr:when” attribute, I can say that the current one doesn’t, and comfortably flag any such attributes as an error.  Even though it is an extension attribute.

If/when such a RFC 4685bis does materialize, the appropriate additions to the whitelist will be made.

If you review the feed validator mailing list, you will see that not everybody who participates has the same level of understanding of the issues as you do.  An all too common question is “how do I fix a 404 error?”.  To explain to such a person that absolutely nothing in the atom namespace is technically an error as the authors of the spec reserve the right to define future additions to this namespace doesn’t do anybody any good.  No matter how “strongly” or how often you have made this recommendation to me.

For that matter, an RFC 4287bis could very well define additional root elements.  Whee!  Everything is valid.  Nobody goes home without a trophy.  Not.

Nor do I want to stop flagging the improper placement of itunes:category elements (probably the most significant contributor to what still is the fifth most common feed validator reported error) simply because you “strongly” recommended it.

I will again encourage you to find specific, and real world, usages which the feed validator flags inappropriately.  Last night, I added a check to cover your absurd soap body example.  I’m thankful that I have a large test suite as my first attempt to fix this would have caused a message to be produced on virtually every RSS 1.0 feed.

Posted by Sam Ruby at

The point of the soap:body example was to demonstrate an single point: the validity of a RFC4287 document does not depend on the currently correct usage of extension elements.  If you wish to validate extension elements in addition to the feed, flag errors relating to those extensions separately from validation errors relating to the feed.  I’m certain there are simple ways to tell your users that while the feed itself may be valid, the particular use of any given extension may not be valid or is, at the very least, questionable.  As it stands now, the current validator output causes confusion by leading users to believe that perfectly valid feeds are not valid due simply to the fact that extension elements are being used in a way the specs allow but the validator developers had not anticipated.

The other example I gave in my soap:body post demonstrates a specific real world use case — using the app:categories element within an atom:feed to point to an Atompub Categories document.  The element is used correctly, and it appears in a location that is perfectly valid according to RFC4287 and RFC5023 does not forbid that element from being reused in other contexts.  The validator should not be flagging it’s use within atom:feed as an error.  It’s really no different than reusing the atom:link element in other contexts (e.g. an RSS feed).  Flagging such things as errors will prematurely restrict the adoption of potentially useful serendipitous reuse of existing extensions.

Posted by James Snell at

Arguing that a dc:date that is encoded as RFC 822 should not make an RSS 1.0 feed be marked as invalid because that particular feed format chose to focus on a small core and a rich ecosystem of extensions kinda misses the point.  This particular check was in the feed validator from the first day it was deployed.

When people point out real world, non fabricated, not hypothetical, and deployed use cases, those specific use cases are quickly accommodated.  I can point to several documents which describe how to use atom:link in the context of RSS 2.0, and to a considerably larger body of existing feeds, and even to a few consumers that actually take this information into account.

If you can do the same for app:categories, then fine.  But bringing up an absurd usage of soap:Body doesn’t actually help your case.

Posted by Sam Ruby at

There are several thousand Atom feed documents currently published on IBM’s intranet that contain app:categories elements.  Does that count as "real-world, non fabricated, not hypothetical and deployed"?

If an extension element is being used incorrectly based on some standard definition or well documented and commonly understood best practice, then by all means signal an error, If, however, you cannot point to any spec text or documented best practice anywhere that says that a particular, unexpected use of any given element is clearly wrong, then the validator should not signal an error; a warning is fine, but it’s certainly not an error.

Posted by James Snell at

There are several thousand Atom feed documents currently published on IBM’s intranet that contain app:categories elements.  Does that count as "real-world, non fabricated, not hypothetical and deployed"?

Perhaps.  The first I heard of it was when I was forwarded an email by you that included someone who was saying that that usage seemed to be somewhat outside of what the spec intended, and you alone defending your rather unusual perspective.  When my initial response was to agree with the person who read the spec differently than you did, the next thing I knew you went public with an absurd example.

If an extension element is being used incorrectly based on some standard definition or well documented and commonly understood best practice, then by all means signal an error, If, however, you cannot point to any spec text or documented best practice anywhere that says that a particular, unexpected use of any given element is clearly wrong, then the validator should not signal an error; a warning is fine, but it’s certainly not an error.

Please read the body of this post.  I readily will agree that employing the use of white lists involves a tradeoff.  One that on balance trades off a few false positives for a more robust approach to validation.  I do not agree with your premise that all that should be thrown away lightly.

Posted by Sam Ruby at

When my initial response was to agree that the person who read the spec differently than you did, the next thing I knew you went public with an absurd example.

You make that sound like a bad thing.  The reason for posting publicly was to see if others felt the same way about it... that is, I specifically wanted to solicit the opinion of a broader community. 

The absurd example was used solely to demonstrate the point that the validity of a feed per RFC4287 has absolutely nothing to do with the definition or utility of the extension elements I may choose to to include in my feed.  The feed may be silly and useless, but it’s still valid. If the validator chooses to support the validation of certain extensions, the warnings and errors related to those should be kept separate from the warnings and errors relating to the validity of the feed.

Please read the body of this post.  I readily will agree that employing the use of white lists involves a tradeoff.  One that on balance trades off a few false positives for a more robust approach to validation.  I do not agree with your premise that all that should be thrown away lightly.

No one is saying that anything should be thrown away.  What I’m saying is that a number of questionably-defensible error conditions should be changed to warnings.  What I’m saying is that if you’re going to presume to validate a feed based on specs and documented best practices, there ought to be actual spec language and documented best practices to back it up.

Posted by James Snell at

James:

So a feed with an atom:LiNK element should pass validation without errors and at most a warning? Whom will that help?

Posted by Aristotle Pagaltzis at

Aristotle: That’s not what I said. The Atom namespace is well documented and does not include an element called “atom:LiNK”.  That would be an obvious error that is well supported by existing spec language.  What I am saying is that it should not be an error to use known elements (e.g. app:categories) in new and undocumented ways when the specification of those elements does not explicitly rule out such use.

Posted by James Snell at

For that matter, an RFC 4287bis could very well define additional root elements.  Whee!  Everything is valid.  Nobody goes home without a trophy.  Not.

That reminds me, I need to dig up and dust off the source to my OPML validator.  It was... concise.  Didn’t have a cool trophy icon, though.  Sounds like a good LazyWeb project.

Posted by Mark at

That reminds me, I need to dig up and dust off the source to my OPML validator.

What’s wrong with the one built into the feed validator.

Posted by James Holderness at

That’s not what I said. The Atom namespace is well documented and does not include an element called “atom:LiNK”.

No, last month it was atom:foo.

Posted by Sam Ruby at


What’s wrong with the one built into the feed validator.

You mean the Ruby feed validator?

Posted by Mark at

You mean the Ruby feed validator?

I linked to it - how much clearer could I be? Considering your name is listed in the copyright at the bottom of the page, I would have thought you had some idea of its existence.

Posted by James Holderness at

No, last month it was atom:foo.

And your point is? I’ve been very consistent about what I think is and isn’t valid and how I believe the validator should be handling extensions. In each case where this has come up, I’ve been able to demonstrate a real use case and back it up with spec text or precedent.  Where is the spec text that says it is invalid to use app:categories as an extension element within an atom:feed element (I’d even settle for a documented best practice). 

I am trying to understand why it would be unreasonable for the validator to issue a warning rather than an error when a known extension element is used in any context other than what it was originally intended when the specs for those extensions do not implicitly or explicitly rule out such use.

Posted by James Snell at

Where is the spec text that says it is invalid

It would be helpful to this discussion if you could demonstrate that you understand the concept of a whitelist.

Posted by Sam Ruby at

I linked to it - how much clearer could I be?

It would be helpful to this discussion if you could demonstrate a sense of humor.

Posted by Mark at

It would be helpful to this discussion if you could demonstrate that you understand the concept of a whitelist

Heh... it would be more helpful if you’d just answer the question. 

Perhaps if I asked the question differently it would help: Regardless of the method you are using to validate feeds, why is it ok for the feed validator to say that a perfectly valid feed is invalid simply because you do not agree with how a particular extension element is being used?

Posted by James Snell at

It would be helpful to this discussion if you could demonstrate a sense of humor.

That would be easier if you were funny.

Posted by James Holderness at

Regardless of the method you are using to validate feeds, why is it ok for the feed validator to say that a perfectly valid feed is invalid simply because you do not agree with how a particular extension element is being used?

Sigh.  Look up the definition of false positive.  I’ve used it several times.

The very same line of code that produces valuable feedback on misplaced itunes categories also provides incorrect feedback sometimes.  A whitelist of elements and locations fixes those specific problems.  An answer of all elements everywhere misses the point, as it suppresses valuable feedback.  Trying to patch that approach with a blacklist is a dead end, as it requires you to enumerate all possible misspellings.  The feed validator has a lot of test cases, but nowhere near enough to support such an approach.

Posted by Sam Ruby at

The very same line of code that produces valuable feedback on misplaced itunes categories also provides incorrect feedback sometimes.  A whitelist of elements and locations fixes those specific problems.  An answer of all elements everywhere misses the point, as it suppresses valuable feedback.  Trying to patch that approach with a blacklist is a dead end, as it requires you to enumerate all possible misspellings.  The feed validator has a lot of test cases, but nowhere near enough to support such an approach.

It seems to me that when a known extension element is used in an unexpected yet valid way, a warning would be more appropriate than an error in that it provides valuable feedback AND avoids the false positive.

Posted by James Snell at

It would be helpful to this discussion if you could demonstrate a sense of humor.

That would be easier if you were funny.

Hey.. if you two are going to start getting snippy with one another, at least have the courtesy to stay on topic :-)

Posted by James Snell at

The very same line of code that produces valuable feedback on misplaced itunes categories also provides incorrect feedback sometimes.

I think the point JamesS is trying to make, is that when there’s a chance of incorrect feedback, the feed validator should be giving a warning rather than an error. I get where he’s coming from, and probably have argued that myself in the past, but I’ve come around to your way of thinking. Especially with the warning having been toned down to a recommendation (which I think is good), something like a misplaced itunes category is too important to just be warned. And, as you say, when there’s a false positive (from a real world use case) you can whitelist it.

Posted by James Holderness at

JamesH: iTunes is very specific on where in the feed it’s elements can be used and the meaning/usefulness of those tags within a feed is well-established.  I would fully expect that any validator that claims to comprehend the itunes namespace would signal an error when dealing with an out of place itunes:category element.  However, RFC5023 is not as explicit about where the app:categories element can be used.  It defines two locations where it is meaningful within the context of Atom Service Documents and says absolutely nothing about it’s use in RFC4287 documents; The most a validator claiming to comprehend the Atompub namespace can reasonably do is signal a warning when the app:categories element is used within an atom:feed.

Posted by James Snell at

FWIW, here’s another example that <i>is</is> based on a “real-world, non fabricated, not hypothetical and deployed” use case.  Specifically, the Lotus Connections Activities component implements the notion of a “Collection of Collections”.  Within the top level collection, each entry represents a sub-collection and contains a corresponding app:collection element.

Posted by James Snell at

And another... this one also based on a “real-world, non fabricated, not hypothetical” approach we are currently exploring as a solution to the problem that Atom Service Documents have no means of unique identifying Atompub collections or differentiating between different kinds of Atompub collections (e.g. a service document may have one workspace with a collection used to manage blog instances along with one workspace per individual blog instance).  Unfortunately, the FeedValidator’s whitelist is incorrectly claiming that this solution is invalid.

Oh, and as a side note: it appears that the validator may be having problems on service documents served with the proper application/atomsvc+xml media type.  When I attempt to serve up the document using the proper media type, the validator complains that it can’t locate the file.

Posted by James Snell at

JamesS: I can’t really comment on your specific case since I know almost nothing about atompub elements and how they’re being used (or are intended to be used) - I stopped following atompub some time ago. I’m just saying that, in general, I think it’s perfectly reasonable for the feed validator to mark everything as invalid that doesn’t have a known valid use case. That’s the whole point of whitelisting. When in doubt, assume the worst - and in this case the worst means invalid.

If you want to argue that your particular usage should be added to the whitelist, that’s a different issue (which as I say I can’t comment on).

Sam: I just noticed that the fragment part of the urls in your comment feed all have an extraneous “.0” on the end at the moment. Looking at past comments, it seems to have started sometime around November 6th.

Posted by James Holderness at

When in doubt, assume the worst - and in this case the worst means invalid.

And why would a “Questionable use of a Known Extension” warning not also be appropriate?

Posted by James Snell at

Why is a validator is expected to proclaim valid extensions it does not support? Isn’t acting as a white list what a validator is all about?

Of course, one might argue that people get scared of extensions if they are proclaimed invalid. But wouldn’t it be better if writers of extensions specs (we want there to be specs for them, right?) coordinated with validator developers to permit the extensions from Day One of each extension?

That is, isn’t better to make the process of amending the white list as low-barrier as possible instead of letting extensions pass by putting them into the unchecked space of a black list?

FWIW, the way I try to tackle the issue in Validator.nu is allowing user-provided schemas, so users can use their extended copies of schemas. This way the validator doesn’t let typos go unnoticed but allows users who know they are doing punch holes they want.

Posted by Henri Sivonen at

I’ve come around to your way of thinking

Thanks.  It is worth nothing that it took quite a lengthy period of time for me to come around to that way of thinking.  Time looking at a lot of feeds, both buggy and non-buggy.  Time listening to the questions, comments, and complaints that have shown up on the feed validator mailing list, and on other feed-related lists (like rss-public).

here’s another example ... and another

Both test cases have been added (entry-with-collection, service-with-id) and fixes have been made and deployed.  The fixes included not only adding these elements, but also adding supporting infrastructure and fixing latent bugs in the feed validator itself necessary to make this work.  I’m not suggesting that it was hard, just that it wasn’t automatic or free.

it appears that the validator may be having problems on service documents served with the proper application/atomsvc+xml media type.

Fixed.  Thanks!

I just noticed that the fragment part of the urls in your comment feed all have an extraneous “.0” on the end at the moment.

Fixed.  Thanks!

And why would a “Questionable use of a Known Extension” warning not also be appropriate?

While the current feedvalidator source does have a list of known namespaces, it does not have a centralized list of known elements.  More importantly, most elements have a lot of implicit semantics, and that requires some code.  A few examples from just the two test cases you provided: I’m assuming that a workspace can have multiple categories but can only have one id.  And that categories can have a term attribute but an id can not.  And that ids can’t be duplicated in a service document (whereas they can in an atom feed).  Hardcoded knowledge such as this allows the feedvalidator to produce error messages such as the this one.

Normally, the above can be discussed and consensus can be reached before deployed in the feed validator.  Ideally, this discussion would take place in a mailing list like atom-syntax where others may participate.  It boggles my mind that Lotus Connections has deployed (at least internally) without that discussion taking place.

In any case, my point here is that the checks that the feed validator makes are context dependent.  Atom elements in Atom feeds have a certain meaning.  Atom elements in an RSS feed tend to enforce less semantics.  Atom elements in other contexts may have more or less semantics, and the only way to determine that would be to look at actual use cases.

Posted by Sam Ruby at

Henri: Why is a validator is expected to proclaim valid extensions it does not support

Never said it was.

Sam: "More importantly, most elements have a lot of implicit semantics, and that requires some code.  A few examples from just the two test cases you provided: I’m assuming that a workspace can have multiple categories but can only have one id.  And that categories can have a term attribute but an id can not.  And that ids can’t be duplicated in a service document (whereas they can in an atom feed).  Hardcoded knowledge such as this..."

While I appreciate you fixing the validator to address the false positives I posted, I never asked or suggested that the validator needs to be able to validate the use of elements like atom:id when they’re unexpectedly used as extensions in other contexts. In the absence of clear public documentation, the most the validator should be doing is returning a “Questionable Use” warning.

Sam: "Ideally, this discussion would take place in a mailing list like atom-syntax where others may participate.  It boggles my mind that Lotus Connections has deployed (at least internally) without that discussion taking place."

I wasn’t aware that implementors had to get prior approval from the mailing list to deploy new solutions; I had assumed that conforming to the relevant specifications and documented best practices would be enough but, hey, I guess not.  And, FWIW, I’ve discussed the use of atom:id’s in service documents and app:collection’s in entries on several occasions on the atompub mailing list.

Sam: "In any case, my point here is that the checks that the feed validator makes are context dependent.  Atom elements in Atom feeds have a certain meaning.  Atom elements in an RSS feed tend to enforce less semantics.  Atom elements in other contexts may have more or less semantics, and the only way to determine that would be to look at actual use cases."

Once again I have to ask: why would a “Questionable Use” warning not be appropriate?  Given a lack of context, when you encounter a known element in an unexpected location and you have no idea why it is there, the only test you can reasonably fall back on is a) whether the container element allows it to be there and b) whether the definition of the element explicitly rules out such use.  If it passes either of those checks, issue a “Questionable Use” warning and move on to the next item.  From an implementation point of view, I cannot see how issuing such a warning would be difficult to do.

Posted by James Snell at

I had assumed that conforming to the relevant specifications and documented best practices would be enough but, hey, I guess not.

The discussion ends here.

Posted by Sam Ruby at

Henri: Why is a validator is expected to proclaim valid extensions it does not support

Never said it was.

I disagree.

James Snell: RFC4287 is very clear about the fact that any namespaced elements are allowed as extensions within atom:feed. So is the Feed Validator right to mark the feed invalid? I don’t think so.  (emphasis added)

It is clear that James Snell isn’t reading (or comprehending) what Henri, JamesH, or myself are writing, nor has he been very consistent about what I think is and isn’t valid and how I believe the validator should be handling extensions.

Posted by Sam Ruby at

The discussion ends here.

Heh, I guess that’s easier than answering the question.

Posted by James Snell at

I guess that’s easier than answering the question.

I count at least seven attempts to answer that question in the text above.  Adding an eight (or ninth?  or tenth?  I lost count) seems pointless.

I now realize that my sentence was ambiguous.  It was not meant as an imperative, but merely as a statement or an observation.  Your lapsing into sarcasm and injecting sentiments that weren’t expressed merely for the emotional impact that such would create is somewhat less than constructive.  There clearly is no discussion going on here, what there is is a soliloquy.

Here’s an offer.  I will mark these specific additions as questionable if that is what you wish.  Not because they are “unrecognized” and certainly I will not downgrade the error of placing itunes:category to warning as you originally suggested, but because these usages are recognized as questionable.

Posted by Sam Ruby at

Shades of Gray

The key lesson from this discussion appears to be this: The FeedValidator will, on occasion, indicate that perfectly valid feeds are invalid for no reason that can be explained by looking at the relevant specs or documentation. Rather than fixing...

Excerpt from snellspace.com at


The following comment is awaiting moderation on James Snell’s blog:

I’ve thought about it overnight. There are vocabularies like SSE and iTunes that are very context specific. There are vocabularies like Dublin Core (and perhaps large portions of Atom) that are fairly generic.

A constructive way to contribute would be to suggest a list of element names that would be whitelisted for generic usage. When elements in this whitelist are found in unexpected locations a “Questionable Use of Extension Element” warning would be generated. Upon demonstration of a real and public use, coupled with even the most minimal amount of documentation, the warning on this specific usage would be eliminated. My intent continues to be to make this a low bar to encourage reuse, and the warning itself would reflect this.

Elements that are in completely unknown vocabularies will generate a different warning.

Elements that are in “known” vocabularies but are inappropriately included in or excluded from the whitelist would be treated as a simple FeedValidator bug.

The current list of known namespaces can be found here:

http://feedvalidator.org/docs/howto/declare_namespaces.html

Care to produce the initial seed for this list?

Posted by Sam Ruby at

Seed list provided.

Posted by James Snell at


Completely off-topic, but search on the weblog software is broken and giving a 404 .

Posted by Keith Gaughan at

Keith: it didn’t find what you were searching for and you therefore proclaim it to be broken?

Posted by Sam Ruby at


Not quite. What I’m complaining about is the fact that when I search for something on the weblog which isn’t there, I should be told that the search failed (and given a 404) rather than splashing up an unfriendly page with the message “The requested URL /blog/?q=foobarbaz was not found on this server”. The search interface is broken because it doesn’t handle the failure case properly. Compare a failing search here with a gracefully failing search.

Posted by Keith Gaughan at

As a post scriptum, sorry for not explaining what I meant more clearly originally. I was a little surprised by the stark Apache 404 page and expected something that looked more like the rest of the site, with an explanation that the search found nothing.

Posted by Keith Gaughan at


Argh! I should’ve expected my comment would cause my example to invalidate itself. Try this instead: [link]

The word given in the query means ‘test’ in Irish, and was something I expected wouldn’t be found.

Posted by Keith Gaughan at

Confession: that’s not a “stark Apache 404”, but a carefully crafted facsimile.

Posted by Sam Ruby at

Ah, but still, it’s a rather stark and unfriendly carefully crafted facsimile. :-)

Seriously though, why did you chose to do it that way rather than something somewhat more user-friendly? The first thing that popped into my head when I saw it was that I’d gone and done something wrong somehow.

Posted by Keith Gaughan at

Ah, but still, it’s a rather stark and unfriendly carefully crafted facsimile. :-)

Seriously though, why did you chose to do it that way rather than something somewhat more user-friendly? The first thing that popped into my head when I saw it was that I’d gone and done something wrong somehow.

Posted by Keith Gaughan at

pmuellr: Would be way less funny if they didn't both work at IBM: http://intertwingly.net/blog/2007/11/16/Whitelisting

pmuellr: Would be way less funny if they didn’t both work at IBM: [link]...

Excerpt from Twitter / pmuellr at


why did you chose to do it that way rather than something somewhat more user-friendly?

Mostly because I was solving another problem at the time (I don’t recall which one, one of the problems I had was crawlers chasing archives to the ends of time — in both directions), and then simply reused the technique in another situation.

I also have a tendency to think of the features on my weblog as only things I would use.

I’ll try to take a look into this by the weekend.

Posted by Sam Ruby at

Add your comment












Nav Bar