The tests generated by xmlfile_test.rb depend on a simple
mock cache of files, served locally. No matter how fast
your network connection is, your local hard drive is faster.
FeedTools should be able to comfortably run a test suite the size
of Feed Parser’s in less time that it currently takes to run
its existing test suite.
With these two files in place, FeedTools can directly make use
of the vast suite of Feed
Parser tests. In fact,
these tests already pass.
My hope is twofold: to accelerate the completion of FeedTools by
dramatically increasing the test case coverage, and secondarily to
spark a discussion as to what a common API should look like, if for
no other reason than to enable FeedTools to leverage the
excellent
documentation provided with Feed Parser.
Please note that Universal Feed Parser 4.0 (currently pre-alpha, available via CVS only) fully supports Atom 1.0 feeds. The test cases are checked into CVS, but the corresponding changes to the parsed data structure are not documented anywhere yet. Here are the changes so far, subject to my whims as I continue working on it:
New entry.published and entry.published_parsed for atom10:published, atom03:issued, and dcterms:issued (entry.issued still works for backward compatibility)
New entry.updated and entry.updated_parsed for atom10:updated, atom03:modified, dcterms:modified, dc:date, and pubDate (entry.modified still works for backward compatibility)
New entry.tags is an array of dict containing {'term': term, 'scheme': scheme, 'label': label}. It is populated from atom10:category, RSS category, dc:subject, itunes:keywords (space-separated), and probably some other elements that I’ve forgotten.
entry.content and other content fields no longer contain the “mode” attribute, which was redundant in any case because the content had already been unescaped/de-base64’d during parsing.
“url” is now “href” everywhere, for example feed.generator.href or entry.enclosures[0].href. “url” still works for backward compatibility. I actually have a function somewhere called “itsAnHrefDamnIt” which maps uri, url, and other attribute names to href.
entry.source is now a dict of a bunch of stuff, populated from atom10:source (and eventually the wacky origLink thingie that FeedBurner uses). Previously it was a string populated from the RSS source element, which was never useful and is now unsupported. This is the only backwardly incompatible change so far.
I welcome feedback on my choice of element mappings, except dates, which I’m sick to death of discussing.
I also look forward to seeing a Ruby feed parser that supports EBCDIC.
Nice. When I first started writing FeedTools, I was still a ruby nuby, and initially, I had wanted to do something like this. But back then, I was still under the false impression that it would be an either/or proposition — either I used Mark’s tests or I had the flexibility of using my own. And since the scope of what I was trying to do with FeedTools was a bit bigger than UFP, I went with my own, and copied in a lot of Mark’s tests. In retrospect, that seems particularly silly that I didn’t even bother to investigate the concept. But anyways.
That’s also the reason that Mark’s RSS tests tend to pass. Those are the ones I had the time to copy over.
The other issue was with the bozo bit. In several of the FeedTools versions, it was present, but I ended up removing it because my parser basically shoots itself in the foot in terms of being able to determine if a feed is valid or not. And besides, I’m not really sure validation ought to be a concern of the parser. It seems to me that dedicated validators do a better job, and most of the time, it’d be useless overhead for a liberal parser. Obviously, the fact that it’s missing causes problems for Mark’s tests.
I saw the solution you came up with for dealing with differences between UFP and FeedTools, and for transforming from Python to Ruby, but I’m not 100% certain that I like it. More of a stylistic thing than anything else, but... honestly, I can’t help but wonder if trying to share the exact same unit test xml files between UFP and FeedTools might be a mistake. Perhaps it might be a better idea to just transform the comments in the xml files and automate that process?
Subsetting or supersetting UFP’s requirements are both valid things to do. But there is a lot of valuable experience behind each and every one of those test cases. Something that should not easily be dismissed.
For that reason, I do believe that there is value in sharing a large subset of the test suite between the various tools.
Perhaps we can come up a with a more declarative and language independent grammar for expressing the bulk of the tests, possibly with a fallback syntax for the small subset of tests which require more specialized handling.
Mozilla Thunderbird’s feed-parser would also like to share a test suite. I’ve got the parsing code running with jsDriver.pl, the unit test script used to test Mozilla’s JS engine. This means I can run large numbers of tests from the command line, using XPCShell.
I asked Mark about this way back when, but the thing was so buggy I got caught fixing bugs without writing tests (I know, I know... lesson learned).
Don’t get too discouraged. As an example of one of the class of errors that I found (and ignored for the moment): in FeedTools, author is a hash containing email, url, name and the like. In UFP, author is a string. However, author_detail is provided which contains the more granular data.
Determining how to reconcile (or failing that, map) these two approaches is key to further progress.
And besides, I’m not really sure validation ought to be a concern of the parser.
I assume by “validation” you mean “well-formedness” — UFP does no DTD or schema validation of any format.
On the subject of well-formedness, I was eventually convinced by interested parties (hi Tim!) that well-formedness is a concern of the parser. Some applications built on UFP may wish to reject feeds that are not well-formed. To support such masochistic party-poopers, the bozo bit was born and has been meticulously maintained ever since.
However, in the process of adding support for the bozo bit in UFP 3.0, I inadvertently stumbled into a rat’s nest of character encoding issues centering around RFC 3023. Briefly: in order to maximize the irony inherent in a feed parser that is simultaneously the world’s most ultraliberal and the world’s most draconian, UFP supports RFC 3023, which specifies the precedence rules for determining the character encoding of an XML document served over HTTP (which would be, like, all of them, at least as far as syndicated feeds are concerned). Some people (hi Tim!) feel that an XML document is a self-contained bag of bits that specifies its own character encoding, regardless of the enclosing transport. To support such delusional thinking — which, by the way, has always been completely and utterly unsupported by the XML specification that said people co-authored, and is now in fact explicitly contradicted by the latest version of said specification — I have classified three of the exceptions captured in the bozo_exception field — namely CharacterEncodingOverride, CharacterEncodingUnknown, and NonXMLContentType — as subclasses of an abstract exception class named, appropriately enough, ThingsNobodyCaresAboutButMe.
These were my design goals. Yours may be different.
I assume by “validation” you mean “well-formedness” — UFP does no DTD or schema validation of any format.
Yeah, that’s what I meant.
Others may simply want to display a visual indicator, like iCab does.
True, though in general I’d contend that nothing user-facing should be displaying stuff like that, unless the intended audience are developers. And if you need that functionality, I’d suggest using a validator designed specific for the purpose.
(We’ll pretend that there actually is a good Ruby feed validator.)
Inspired by Sam Ruby’s work on applying the Universal Feed Parser tests to the Ruby FeedTools, I’ve spent a little time this afternoon working on testing XML_Feed_Parser with that same test suite. There’s a lot of work to do!...
I managed to recover my laptop battery within two weeks of good charging practices and despite its old age. It’s an ASUS, just in case you were interested, and it’s been serving me well since the last quarter of 2001 when I bought it. As...
Its been a long standing todo to port Mark’s FeedParser tests to work against Magpie, possibly with an intermediate representation to allow cross-language testing. (has any work been down on capturing unit tests/acceptance tests in XML?) Sam’s...
Anyways, I’m finally implementing some of the encoding stuffs, but ebcdic seems to be problematic:
converter = Iconv.new('ebcdic-cp-be', 'utf-8')
Errno::EINVAL: Invalid argument - iconv("ebcdic-cp-be", "utf-8")
from (irb):15:in `initialize'
from (irb):15:in `new'
from (irb):15
from :0
iconv -l | grep "EBCDIC"
=> shows nothing
This is on OS X... any idea why ebcdic wouldn’t show up and/or how I might rectify that situation?
Pirate Testing (Because Only Ninjas Write Unit Tests)
I’ve got a new favorite development technique, “pirate testing”. I’ve used it on 3 recent projects, and it rocks. And while Sam might have meant it literally, I’ve found it perfectly describes the practice of shanghaiing another tool’s test suite...
Since people were waving tasks around several months ago and I finally got tired of atom feeds showing up incorrectly in Gregarius, I decided to port some of Feedparser tests to Magpie. I created a rudimentary ajax-ified unit testing harness for...
This post is huge but I have not the time to make it smaller. I’m so very tired. A Quick Introduction rFeedParser is a RSS/Atom feed parser. It is a translation of Mark Pilgrim’s feedparser from Python to Ruby. It behaves almost exactly...