It’s just data

Validate on subscription?

I've thought about Brent's proposed compromise, and to borrow a phrase that is a favorite of Tim Bray, I think that there is a way that 80% of the value can be obtained with 20% of the effort.  Is there really a market requirement to be selectively pedantic on a feed by feed basis?

It seems to me that there are two levels of errors.  Unrecoverable, and recoverable.  A HTTP status code of 404 is something that the aggregator can not work around.  On the other hand, a malformed date may marginally reduce the user's experience, but arguably should not prevent the user from seeing what other data can be salvaged from the feed.

Unrecoverable errors, by necessity, needs to be handled each time a feed is retrieved, but do recoverable errors need to be reported each time such an error is encountered?  I mean, do thousands of people need to be alerted whenever a stray smart quote appears on boingboing?

From strictly an engineering point of view, is that the right design for a feedback loop?  My experience is that, in addition to alerting the wrong person, an overabundance of such alerts tends to dull the message.  People simply will tune them out.

An alternative might be to only validate on subscription.  This would certainly reduce the number of such messages.  The also would present such messages to users at a time when they might expect feedback.

I would also suggest that all such messages be oriented to their target audience.  If a feed contains encoding errors, let the user know that some characters may not appear as intended.  If the feed is missing a required element, tell the user what they will be missing.  If a date is not of the appropriate format, let the user know that such information may be misinterpreted or ignored.

This information could be accompanied by a simple checkbox to inhibit the display of further messages.

Hopefully, such an approach will ultimately result in a more educated consumer base.  A greater demand for higher quality feeds would certainly not be an unwelcome side effect.  It also means that feeds would be sampled regularly.

Parting thought: in my opinion, such checks don't have to be bullet proof, merely effective.  Apply the 80/20 rule here too.  The well formedness checks provided by your off the shelf parser can generally be obtained with a few lines of code.  Ditto for a simple scan for required elements.  I can share the regular expressions used by the feedvalidator.

However, I do have one suggestion.  I would suggest that this not lead to a practice whereby each consumer documents what subset or superset of the various specifications they support at the moment.  It would be better for all concerned if such checks are made, and errors are reported, in terms of the original specifications.


Sam,

Just expose the feedvalidator as a webservice :) Preferrably, something simple and RESTful. Perhaps a GET, with the URL of the feed provided as a parameter and the result an XML doc describing the errors in the feed.

Posted by Bo at

Do thousands of people need to be alerted whenever a stray smart quote appears on boingboing?

I think we're approaching this problem from the wrong direction. Instead of trying to make it easier for RSS consumers to validate a feed, we should be making it easier for RSS producers to learn if their feed is invalid.

I believe many people would be happy to fix their feed's errors, but going so far as to stop by feedvalidator every time they make a post is too much trouble. Therefore, it the blogger can't come to the validator, the validator should come to the blogger.

Someone should set up a system where people can sign up to have their blog's validity checked by a parser every n hours, and if there's a problem, they receive an e-mail with the error text explaining how they can fix it.

Alternatively, another solution would be to issue some kind of TrackBack ping to the validator upon each post and have that trigger a validity check on the URL sent. However, I'm not quite sure how to best determine an e-mail address to mail the results, given that nobody uses the <webMaster> tag because of spammers.

I'm not sure exactly of all the specifics here, but I'm sure they could be worked out to balance ease-of-use and server resources issues. I do think it's important that if the feed validates then you don't receive e-mail, because otherwise you'll end up spamming yourself, and it's important for failures to stand out.

Posted by Adam Trachtenberg at

I've been told I'm an aberration often enough about this that I don't expect anyone to build what I want, but I for one would welcome selective alerting: if a member of my tribe slips up, I want an email to them to pop open, so I don't shirk my duty to let them know, but if it's one of my enemies? So be it.

And while I see your point about not selectively reporting missing things, I've spent enough time trying to convince people to add unused elements to their feeds so that aggregator authors would be willing to make use of them that I can also see the value in selectively saying "this feed lacks the <comments> element that would let me put a link to the comments over there on the right, and lacks the <slash:comments> element that would let me tell you how many comments there are already."

Posted by Phil Ringnalda at

Heh. It's twisty, but the Trackback ping to validate might actually work with Movable Type's implementation at least. You can have a URL that you ping for every post in a category, and if you pinged the validator for every category (and categorized everything), then when you pinged from a post that made your feed invalid, the validator could just fail to respond, which would throw an error message in MT, and would tell it that it needed to try pinging again the next time that post was saved. Not exactly a consumer-grade web service, but cute.

Posted by Phil Ringnalda at

It turns out that lots of people use NetNewsWire to monitor their own feeds. They subscribe to their own feeds, and if they stop working in NetNewsWire, then they validate them and fix the bugs.

It's for these people, in part, that I want the ability to be selective about which feeds require well-formed-ness and which don't.

There are also other people who simply care a great deal about this issue, and want to require well-formed-ness for all their subscriptions -- except for, say BoingBoing, which they know will have stray smart quotes sometimes but they want to read it anyway.

And then of course there are the majority of users who don't and shouldn't care about well-formed-ness. For them, the defaults will work nicely -- NetNewsWire will not require well-formed-ness and will not report well-formed-ness errors. In other words, it will work exactly as it works now.

I can't stress enough that all this well-formed-ness checking and error reporting will be optional, off by default.

Also, I don't plan at all to report bugs like malformed dates. Instead I plan to make the Validate this Feed command more prominent so people will use yours and Mark's on-line validator.

Posted by Brent Simmons at

Format negotiation, internalized and/or requested

I think "negotiation" is an interesting concept to add to this discussion. . . I guess the assumption with RSS is that it is harder to "ask" an RSS server to deal with the format of its RSS than it is to deal with this in the RSS reader. I guess, in some ways, fat clients live on!... [more]

Trackback from the iCite net development blog

at

I had the same reaction to the error reporting part of Brent's piece!

Spring, my universal canvas app for OS X, uses XSLT [1] to convert all incoming XML to our XML format (Conceptual Object format) before converting to a dictionary. So, liberal XML parsing doesn't come without high engineering costs. I assume other tools that rely on XSLT face similar costs.

Robb

[1] Using a framework from TestXSLT, [link]

Posted by Robb Beal at

If people won't go to the validator

I think there are a lot of people who are willing to do a little bit to improve feed quality, if it's not too hard and they can do it from where they are already.... [more]

Trackback from dive into mark

at

Automatic Feed Validation

The recent discussions about how strict aggregators should be when reading invalid or ill formed feeds (e.g. RSS or Atom) brought to mind an idea: an automatic service that checks your feed for validity and send e-mail whenever it finds...... [more]

Trackback from Cantoni.org

at

Validate on Subscription (or, my turn to compromise...)

Sam Ruby proposes that aggregators validate on subscription, and I have to confess that this makes more sense than my stated position of requiring Atom feeds to be well-formed. What Sam suggests is that aggregators such as FeedDemon inform of......

Excerpt from Nick Bradbury at

Hmm, I agree that the time of subscription is a critical one, but I've a feeling that retreating back to a 'warn only' position defeats the whole object. The dialog box might as well just say: "That was invalid, I don't care [OK]".

"Do thousands of people need to be alerted whenever a stray smart quote appears on boingboing?" Well, yes, if you think boingboing should be publishing valid XML. There should be at most one stray quote, before it gets fixed (following thousands of complaints).

Like Robb, I've got XSLT on my front end (nurse!) and there is a cost associated with tooling for tag soup.

But I think the far greater cost will be in losing the 'default' of validity (XML and Atom) that a more draconian approach would provide.

Assuming that everyone runs scared of valid XML, I think I might still try a 2-tier approach. If the data is valid Atom or RDF, it gets first class treatment. If it says it's Atom but doesn't validate, or is one of the looser RSS specs then the feed URI gets tagged as 'potentially unreliable' and left out of processing where junk might mess things up - indexing etc.

Posted by Danny at

The "well-formed" terminology makes this look like it's a purely ideological. Call it a "syntax error" and suddenly it doesn't seem petty. Similarly, "improving the well-formed-ness of feeds" isn't what this is about - it's about whether you can use a standard XML parser without your program becoming a second class citizen. I think programs that say there's an error but still continue harm that goal, since they're basically saying "people with inferior software can't read this". On the other hand, I do like Sam's idea of telling the user how it might affect them, which at least partially negates the superiority effect.

Posted by Graham at

Choice language

Incisive comment from Graham re. Atom and client-side handling : The "well-formed" terminology makes this look like it's a purely...... [more]

Trackback from Raw

at

Graham, this week you say you want everyone to be able to use a standard XML parser, but last week you were treating Shrook's draconian Unicode error handling as a bug and "fixing" it by making it more tolerant.  I accept that there are people in the "draconian" camp who will never budge, but please pick a camp and stay there.

Posted by Mark at

It's what's known as a compromise Mark. During early testing too many feeds were failing for Shrook to be a viable product. But I'm not going to go any further. There's a big difference between structural flaws and the encoding not having been labeled. Shrook still does everything with a standard parser - crucially it isn't doing any of its own XML parsing. It just has two lines of code that do a Windows-1252 -> UTF-8 translation and try again. (btw Shrook doesn't have "Draconian Unicode error handling", it was just tripped up by Sam's particular test)

Posted by Graham at

Graham, you know where I stand on error recovery.  I think second-guessing the encoding is a great idea, the more guessing the better.  But some in the draconian camp would say that you're rewarding bigots and racists.

Tim Bray: "You need to know what encoding your data is in, so that for example when you see a Euro sign you know enough to emit &#x20ac;, not some Microsoft Code Page byte that's guaranteed not to work on lots of browsers.  This can be tricky.  But the alternative is, you're a parochial bigot. ... If your software can't manage to escape five special characters and fill in end-tags and quote attributes, it's failing to meet such a very low barrier to entry that it's probably pretty lame anyhow.  And if developers are not willing to put in the effort to enable the non-white people of the world to use their software, I don't think [we] should condone or reward them."

Posted by Mark at

Mark, perhaps there aren't exactly two camps.

Posted by Sam Ruby at

Sam, according to the draconians, there are exactly 2 camps: the draconians (those who reject all ill-formed XML) and the tolerants (everyone else).  Either a document is well-formed XML or it's not.  You can't be a little ill-formed, just like you can't be a little pregnant.

The difference comes in how you choose to be tolerant, and here is where people baffle me.  The tolerant camp is a very big tent.  And according to the draconians, the minute you step into the tolerant camp (no matter which door you enter), you're rewarding bigots who hate brown people.  So at that point, why not go all the way?  I mean, why be a little tolerant?

Shrook accepts and displays documents that are not RSS, because they have misrepresented their character encoding.  NetNewsWire accepts and displays documents that are not RSS, because they have unescaped ampersands.  Pretty much everyone accepts and displays documents that are not RSS, because they contain illegal control characters.  RSS Bandit accepts and displays documents that are not RSS, because they have invalid date formats (a validity issue, not a well-formedness issue).

My parser is the sum of all these sins, plus some.

Everybody's tolerant, but we're all tolerant in different ways.  This is, in fact, exactly the nightmare scenario that the draconians envisioned 7 years ago.  Hell, it's exactly the nightmare scenario the tolerants envisioned 7 years ago -- that nobody would be able to stomach "reject on first error" at an application level, so they would play games with their underlying "conforming" XML parsers by second-guessing them and feeding them crap repeatedly until it finally got accepted.

Mandatory draconian error handling hasn't increased interoperability; it's destroyed it.  Nobody can stomach it, so everybody skirts it "just a little" -- each in their own way.

Posted by Mark at

Mark, please define "all the way".  If I find a new date format that the ultra liberal parser's "universal date parsing" does not handle, is that a bug?

From my point of view, well formedness is just one component of validity checking, and by that criteria, everybody is "a little pregnant".  As far as the racial references, Let That Be Your Last Battlefield.

Posted by Sam Ruby at

((Should we encourage the use of tolerant parsers/compilers for programming languages as as well?) flamebait)

Posted by Danny at

In private email, Bob DuCharme asked where AtomTidy and RSSTidy are?

Posted by Robb Beal at

re: "If I find a new date format that the ultra liberal parser's "universal date parsing" does not handle, is that a bug?"

Yes.

Posted by Mark at

And if the ultra liberal parser can't find descriptions in feeds like this one, it that a bug?

It is only a matter of time before I find a line.  The domain of invalid feeds has a higher cardinality than domain of valid feeds.

Posted by Sam Ruby at

Sam, that example is actually one of the reasons I wrote the ultra-liberal feed parser in the first place, because The Register used naked markup like that in their description.  If it can no longer handle that naked markup like that, I would classify that as a very serious bug indeed.

I'm working on a test suite for the feed parser, which would hopefully prevent regression bugs like this.

There is no line.

Posted by Mark at

Mark, if you don't like what XML is, why do you put your effort on diluting what XML is (which can be damaging to vocabularies other than Atom) instead of lobbying that Atom not be based on XML 1.0?

Posted by Henri Sivonen at

Quick Links - 2004 01 18

Last update: 18/01/04; 14:15:25 EDT Scripting, Blogging, Softwares... Adam Trachtenberg: Using PHP 5's SimpleXML: SimpleXML is a new and unique feature of PHP 5 that solves these problems by turning an XML document into a data structure you can...

Excerpt from blog.scriptdigital.com at

Add your comment