It’s just data

Yet Another Planet Refactoring

A little over a month ago, I outlined how I would like to see the feed parser reorganized.  I’ve now put a little meat on the bones, in the form of running code.  Not just for the feed parser, but also for Planet.  I also did it all in Ruby, so I named this little experiment Mars.  Warning: this version is 0.0.1.  It just barely runs end-to-end.  Feed it real data, and it will choke on some of it.  But it can now produce partial results.

Inventory:

config
To keep things compatible, I ported the parsing logic from Python’s ConfigParser to Ruby.  Once parsed, the results are accessed as a Hash.  Eventually, I’ll provide Planet specific logic, like defaults, in this module
fido
Fido fetches (get it?) feeds.  It caches, follows redirects, times out, handles HTTP status codes, and compression/zip.  It also distributes work across multiple threads.
xmlparser
This module will use one of four XML parsers, and return the result as a REXML document.  If installed, it will use the fast and standards compliant expat or libxml2 parsers.  If neither are installed, it will use the slower and less compliant (to this day, it still parses <a>&a</a> and <a a='<'/> without error).  In all REXML isn’t too bad... as long as you don’t depend on it for serialization or deserialization or XPath or expect quick turnaround on bug fixes or responses on their mailing list.  In the event the chosen parser fails to parse the document, the HTML5lib liberal XML parser will be used, and a bozo flag will be set on the document itself.
Transmogrify
With RSS, there often are several ways to express the same concept (shades of TMTOWTDI?).  Atom aspires to Python’s philosophy of There should be one—and preferably only one—obvious way to do it.  This module is clearly opinionated software in that it will transmogrify feeds which use less obvious constructs into more obvious ones.
sift
Sift will filter out impurities and break down HTML into elements than can be iterated over.
forgiving_uri
This is from Bob Aman.  I need to look into Addressible (also from Bob Aman) to see if it is a better alternative.
spider
Spider orchestrates the retrieval and sifting of feeds, breaks the results into a set of entries, adds in source information, and caches the results.
splice
Splice will select the latest entries and produce a feed, and then process this information using a user supplied list of templates.  At the moment, the only templating language supported is XSLT.
style
If libxslt is installed, it will be used, otherwise an attempt is made to shell out to xsltproc.
harvest
Not yet integrated into the Mars proper, but used to drive the testing (and therefore the development) to date, this function dynamically constructs dictionaries from an Atom document.  This will undoubtedly be useful for other templating languages.  As everything is constructed dymanically, multiple date formats, mutiple serialization formats (think: dropping superfluous quotes on attributes and pesky things like explit paragraph end tags for HTML5 purists), and multiple aliases for any given element are no problem.

Also provided

planet
A small main program to drive the execution.
reconstitute
A demo program which will enable you to see what the parsed, transmogrified, and sifted version of any given feed looks like
test/feedparser
Runs a small portion (currently wellformed/(atom10|rss)/*.xml) of the feedparser test suite.  Check the comments to see what is not yet supported (mostly elements like cloud and textInput

All in all, I’m pleased with how compact this code is.  If anybody wants to join in on the fun, it is available as a bzr repository and there are plenty of test cases ready to be ported.


The link to fido is wrong.

Posted by tim at

Fixed.  Thanks!

Posted by Sam Ruby at

If you’re looking at rewriting the Planet code, I have a request for you to consider. I suspect this won’t be feasible, but it would be really nice if the aggregated feed from the Planet could have the absolute minimum of filtering and transmogriphying applied to it.  Two reasons:

1. When you filter the feed, you butcher a lot of the content. Stuff which you may not want in your HTML view (for security reasons or whatever), is often very useful for people viewing the feed from a desktop feed reader (think embedded videos, and microformats).

2. I’ve recently started experimenting with duplicate detection across feeds (I’ve reconsidered my previous position on the subject). The problem I’m finding is that the feed from Planet Intertwingly often contains different content to the source feeds, and I flag any such changes in an effort to identify hack attempts. This becomes somewhat counter-productive (and annoying) when items are being flagged almost all the time.

I can go into more detail if this seems like something that could be addressed. If not, I understand.

Posted by James Holderness at

absolute minimum of filtering and transmogriphying

A few questions:

A suggestion that if it pans out could possibly even be applied today: if there is a <source> element present, treat the entry as being of a lower “fidelity” than the original.

Posted by Sam Ruby at

Bare in mind that I’m doing my own normalizing/transmogriphying. Encoding, relative URIs, even changes in feed format shouldn’t be much of a issue. And obviously a lot of that sort of thing has to be done for you to produce a unified feed. I’m more worried about changes to the actual text content in the feed. I’m not even comparing markup, so you would think this should work fairly well, but it doesn’t.

I haven’t looked at it in much detail yet (didn’t think there was much point if you can’t do anything about it), but one example I saw was when you converted an html atom:summary to plain text. Stripping the markup wasn’t a problem - it was the whitespace that became an issue. In HTML, whitespace is generally not significant, but in plain text it is. So your conversion appeared to me as a significant change in content.

I could probably deal with that particular issue myself with a more relaxed string comparison, but the underlying problem remains - if you’re changing the message content, then something is bound to break sooner or later.

if there is a <source> element present, treat the entry as being of a lower “fidelity” than the original.

I’m already doing that. However, consider this scenario. The users refreshes a Planet feed and gets a bunch of new messages which he immediately reads. He’s also subscribed to some of the feeds individually, and when those feeds refresh he gets a new set of messages all of which he has already read. However, since the content now appears to be different, the app is obliged to inform the user (otherwise he won’t know that the messages he previously read may have been fake). That’s what I’m trying to avoid.

Also, a source element doesn’t solve my first problem (in the case of feeds that I’m only seeing via the Planet). I’d like to be able to view an embedded video in the Planet feed without having to subscribe to the source feed separately.

Posted by James Holderness at

When you filter the feed, you butcher a lot of the content.

Perhaps we could dispense with the editorializing?

Bare in mind that I’m doing my own normalizing/transmogriphying.

I see.  When we do it, it’s butchering.  When you do it, it’s cute and fluffy like something out of a Calvin and Hobbes cartoon.

Posted by Mark at

rawr!

Posted by Robert Sayre at

Perhaps we could dispense with the editorializing?

Yeah, that’s what I’m asking. Although I probably wouldn’t have said “editorializing”. Maybe "butchering"?

When we do it, it’s butchering. When you do it, it’s cute and fluffy like something out of a Calvin and Hobbes cartoon.

No, when I do it, it’s still butchering. But I’m not republishing the butchered data with an atom:id that implies it’s the same thing as the original.

rawr!

Tigers are great.

Posted by James Holderness at

relative URIs ... shouldn’t be much of a issue. ... I’m more worried about changes to the actual text content in the feed.

What about relative URIs in the actual text content?

In HTML, whitespace is generally not significant, but in plain text it is.

Only if you want it to be.  RFC 4285 § 3.1.1.1

Stripping the markup wasn’t a problem

Mars doesn’t intentionally strip valid markup.  But the code is new and bound to be full of bugs.

But I’m not republishing the butchered data with an atom:id that implies it’s the same thing as the original.

I also note that you haven’t answered my original questions.  There will be changes to the feed.  For you to be able to do any duplicate detection, there original id will need to be propagated — either as the id (per the spec) or in an extension (as Google Reader does).

The codebase is young and easily refactored.  If you have specific proposals, I’ll be glad to work through them with you.

Posted by Sam Ruby at

What about relative URIs in the actual text content?

By text content, I meant text and only text (not markup). You’re assumedly not making changes to bits of plain text that look like they might be relative URIs.

In HTML, whitespace is generally not significant, but in plain text it is.

Only if you want it to be.  RFC 4285 § 3.1.1.1

I realise that. I don’t have a problem with you choosing to ignore whitepace when you display it. However you’re republishing the content with changes that make it impossible for me to make that choice.

Mars doesn’t intentionally strip valid markup.

I was referring to what I saw in Planet Intertwingly (which is assumedly still Venus), namely the conversion of an html atom:summary into a plain text atom:summary. Obviously that would result in markup being stripped if there was any, but I can’t find any examples now of summaries containing markup (that couldn’t at least be converted) so maybe that never happens. Either way, I didn’t have a problem with that.

I also note that you haven’t answered my original questions.

Looking back I think I answered all your questions, except “What if the original feed isn’t well formed?”. The answer is the same for all of them. It doesn’t matter. I understand that you need to make certain syntactic changes to the source data in order to produce a unified, valid atom feed as output. That’s not a problem. It’s semantic changes that bother me.

Removing a video from someone’s message is not the same as converting their encoding from UTF-16 to UTF-8. When it comes to correcting well-formedness errors or resolving relative URIs in RSS, the issue becomes less clear cut because the semantics weren’t clear to begin with, but those are edge cases. And with any luck, we’ll probably agree on those semantic interpretations anyway.

There will be changes to the feed.

Understandable. As I said above: syntactic ok; semantic not so ok.

If you have specific proposals, I’ll be glad to work through them with you.

Well here’s one idea. Can you not pull out your filtering/whitelisting code into a separate module? Then rather than applying it to the feed content before writing to the cache, let the cache keep the unfiltered content, and only apply the whitelist module when generating the HTML for your web view. That way the feed can still be generated with the unfiltered content.

Posted by James Holderness at

Can you not pull out your filtering/whitelisting code into a separate module?

The purpose of this particular refactoring is to enable exactly that sort of experimentation.  Part of the experimentation will be to enable publishers of planets to set policies.  I, for example, don’t like the thought of my planet being used as a vector to distribute a script attack.  I also see value in videos, so if I can find a policy I am comfortable with, such will start showing up in the html output as well as the feed.

Posted by Sam Ruby at

When it comes to correcting well-formedness errors ... in RSS, the issue becomes less clear cut

There is no spec governing the correction of non-well-formed Atom feeds either.

not the same as converting their encoding from UTF-16 to UTF-8

Unicode Normalization Form C or Unicode Normalization Form KC?

Posted by Mark at

Nice! Is this meant to continue as an experiment or does this effort mean you will stop work on Venus and UFP?

Posted by Ralph Meijer at

Sam Ruby:

Part of the experimentation will be to enable publishers of planets to set policies.  I, for example, don’t like the thought of my planet being used as a vector to distribute a script attack.

I figured you’d say that. :)

Now consider someone of religious persuasion that doesn’t want their planet feed being used to distribute foul, blasphemous language. As a result, they apply a filter to all feed content that automatically removes any swearing or references to Richard Dawkins. At what point do you consider the content in such a feed to be a derivative work?

And if it is a derivative work, should the atom:ids not be different from the ids identifying the original work? I’m not really convinced either way.

I suspect there are legal implications too, but that’s not of much interest to me.

Mark:

There is no ... well-formed Atom ...

Amazing how you can twist what someone says when you quote selectively.

not the same as converting their encoding from UTF-16 to UTF-8

Unicode Normalization Form C or Unicode Normalization Form KC?

I wasn’t aware that Unicode Normalization was required when converting from UTF-16 to UTF-8. Oh wait, it isn’t. I choose “none of the above”.

Posted by James Holderness at

Regarding Addressable, yes, it is a better alternative, and it’s definitely what you want to be using.  IIRC, There are some bugs with the old code you’re using that the now-properly-refactored Addressable code deals with.  Mostly related to URI normalization and percent escaping.  Plus the Addressable code is far better specified, with 100% C0 code coverage, and a nearly 3:1 spec to code ratio.

Posted by Bob Aman at

I flag any such changes in an effort to identify hack attempts.

A modest proposal:  When attempting to identify “hack” attempts, assuming you already have both pieces of content, don’t do so with a simple string-based comparison.  Tokenize the content first, resolve any URIs, and then compare the two token sets.  Intentionally make the tokenization step lossy, but avoid any lossiness that might be obviously exploitable.

And if you don’t have both pieces of content, maybe you should?

Posted by Bob Aman at

Is this meant to continue as an experiment or does this effort mean you will stop work on Venus and UFP?

I honestly don’t know.  I don’t have any current plans to stop work on Venus and UFP.  Meanwhile, I am continuing to commit function to Mars, and have set up a temporary, parallel planet running the absolute latest.

The biggest factor to me is the size of the development community that each code base attracts.

Posted by Sam Ruby at

When it comes to correcting well-formedness errors or resolving relative URIs in RSS, the issue becomes less clear cut because the semantics weren’t clear to begin with, but those are edge cases.

Happy?  It doesn’t change the fact that you’re on your own with non-well-formed feeds of any format.  Unlike HTML5, the Atom Working Group chose not to deal with the issue of error correction.  But of course you knew that already.

I wasn’t aware that Unicode Normalization was required when converting from UTF-16 to UTF-8.

This thread began because you were complaining about the problems you were having trying to compare strings for equality.  I naively assumed that you gave a shit about whether they were, you know, equal.  Just out of curiosity, how DO you plan to compare strings, once Sam is done bending the world to your whim?

Posted by Mark at

This thread began because you were complaining about the problems you were having trying to compare strings for equality.

The things that get you folks fired up.

But Mark is right, if you actually want to make a big deal out of “hack” attempts, normalization has to be done.  If it’s even a possibility that a legitimate republisher of content has normalized the content (which it is), then you have to take that into account.  So one of the transformations you’d make during tokenization would probably be to convert to UTF-8 and normalize.  I’d probably use the same normalization steps done for IDN, because lossiness is desirable here.

That said, I think this is an incredible waste of time.  If “hack” attempts actually matter to you, I’d probably go with a more low-tech solution.  Again, assuming you already have both sets of content, just create a “tabbed” interface that lets you select either content source for display.  This should be trivial for both web and desktop applications.  Diff as necessary.

Posted by Bob Aman at

Where is the bzr repository?  bzr branch [link] doesn’t seem to work (sorry, I’m new to bzr).

Also, rfeedparser seems to be a pretty good project.  Why did you choose to roll your own?  Just curious what rfeedparser’s shortcomings were.

Posted by Scott Bronson at

Lovely... now my email address is harvestable.  Good thing I tagged it.

Sam, how about putting a warning on the E-mail box?  Something like, “WARNING: if you do not supply a URI, I will display your address publicly!”

(Sorry, I would have sent this privately if I could have easily found your contact info...)

Posted by Scott Bronson at

Thank you all for your suggestions, but you’re trying to solve problems that I don’t have. And for now I’ve decided to pull this feature anyway.

My initial request still stands: it would be nice if the feed from Planet Intertwingly returned the source content unfiltered. However, I can accept that that’s not likely to happen. I guess I always have the option of subscribing to the individual feeds myself.

Posted by James Holderness at

bar branch http://intertwingly.net/code/mars/ doesn’t seem to work

Try bzr get.  More info here.

rfeedparser seems to be a pretty good project.  Why did you choose to roll your own

It shares a design approach with feedparser.  Namely that it provides the sanitization, with no access to the original.  It “butchers” extensions.  It converts everything to a Hash, which (with Venus’s design) needs to be converted back to an XML document, and then (if you use HTMLTmpl) back to a Hash.  Both use sgmllib (or equivalent) instead of html5lib.

I honestly don’t know how far this experiment will take me, but so far it looks promising.

my email address is harvestable

Removed.  Note: that field is optional.

Posted by Sam Ruby at

[*aside] Has it's own weather system

At this point, I’ve got at least a half-dozen or so separate blogs in various stages of disuse and serial neglect. I expect that this trend will continue into the foreseeable future, with me hopping on to different publishing platforms and suchlike...

Excerpt from decafbad recaffeinated at

Hi Sam.  So far I’ve been enjoying playing with your code.  Two questions: what license are you releasing it under?  Some variation of the Python license, like the old Planet code?

Also, in config.rb:

  next if line.split(nil,2).first.downcase and ‘rR’.include?(line[0])

What is the intent of this line?  afaict, the first expression always returns true and the second one will return true if the line begins with ‘r’ or ‘R’.  What do you have against config options that begin with “r” and don’t have any leading whitespace?  :)

Thanks!

Posted by Scott Bronson at

what license are you releasing it under?

MIT

Also, in config.rb:

next if line.split(nil,2).first.downcase and ‘rR’.include?(line[0])

That’s a bug.  You can find the original Python in ConfigParser.py,

if line.split(None, 1)[0].lower() == 'rem' and line[0] in "rR":

If you haven’t do so recently, bzr pull the latest as there have been a number of fixes.  If you are in a position to do so, please publish any changes you may make in a bzr repository so that I can pull them from you.

Posted by Sam Ruby at

I’ve settled on the following as a fix to the config problem cited above:

next if line =~ /^rem(\s|$)/i
Posted by Sam Ruby at

sweet of tools from Sam Ruby including a RequestAgent-alike and some feed parsing stuff...

Excerpt from del.icio.us/atduskgreg at

The biggest factor to me is the size of the development community that each code base attracts.

Where would said community gather? I’ve got some ideas for contributions--is there a mailing list, or are the comments on this entry the best place to discuss it?

Posted by Phil at

Comments on this entry (perhaps even just a pointer to proposals defined elsewhere) would be fine.  If a mailing list is preferred, we could use this one.  Ultimately, it may make sense to explore merging with this project.

Posted by Sam Ruby at

I coded up a patch that adds support for importing subscription lists from OPML files with REXML and provides sane default config values: [link]. (This is my first bzr use, so if there’s a better way to submit contributions I’d love to hear it.)

I’m a little curious as to why you recommend running ruby with the -rubygems switch every time rather than writing out a simple “require 'rubygems'” once in the actual code. Seems like one more thing to forget.

Would be great if it could use OPML or YAML by default as the INI format feels quite foreign to Ruby. Right now my patch just uses OPML if the config filename contains “opml”, otherwise it runs it through the backwards-compatible parser. Subclassing ConfigParser from PythonConfigParser seems a little odd though.

Posted by Phil at

This is my first bzr use, so if there’s a better way to submit contributions I’d love to hear it.

Simply scp/rsync/ftp your entire directory structure up to technomancy.us, and let me know the URI I can use to get to it.  That’s it.  Nothing to install.  I can then bzr pull from you, and vice versa.  Or, for more complicated situations bzr merge followed by bzr commit.

Would be great if it could use OPML or YAML by default as the INI format feels quite foreign to Ruby.

Supporting YAML would be cool.  And, obviously, quite easy.

Right now my patch just uses OPML if the config filename contains “opml”, otherwise it runs it through the backwards-compatible parser.

We could change it to

begin
  send "read_#{filename.split('.').last}", filename
rescue NoMethodError
  STDERR.puts "Unsupported file format"
  exit
end

Subclassing ConfigParser from PythonConfigParser seems a little odd though.

Feel free to rename the class.

Posted by Sam Ruby at

Sam,

Do you have any opinion about the best order for porting the test cases? I have some spare cycles to work on them. Finish the feedparser cases first? Input through to output? Middle out?  Any guidance you have would be appreciated.

Posted by Jim Holt at

Do you have any opinion about the best order for porting the test cases?

At the moment, IMHO, the most important piece of missing functionality is support for a second templating system.  XSLT is sufficient for my needs and for proof of concept, but I imagine that most would prefer another templating system.  I’d like for one of the templating systems supported to be htmltmpl compatible, but that doesn’t have to be the next one.

Posted by Sam Ruby at

Cool; bzr seems easier to publish than git. ERb is the obvious choice for a templating system coming from the perspective of a Rubyist. It’s in the standard library, everyone who’s used Rails knows it, and it’s pretty trivial to implement support for. I’ll see if I can give that a go.

Posted by Phil Hagelberg at

bzr seems easier to publish than git

The procedure is exactly the same in git; you need to do more only if you want more efficient synch than via HTTP. And I’ve been told that this works for Mercurial also.

Posted by Aristotle Pagaltzis at

Hi Sam.  You say in the readme, “REXML version 3.1.6 won’t do.”  Well, apparently neither will 3.1.7.1:

  2) Failure:
test_102(XmlParserTestCase) [./test/xmlparser.rb:10]:
<nil> is not true.

svn, as you mention, works just fine.

Posted by Scott Bronson at

Hi Sam,
I have a dot release (0.4) available for processing haml templates in Mars. Haml is described at [link]  The bazaar repository is available at [link]. While the code still needs work, the template (see index.html.haml) produces a very good facsimile of the Mars version of Planet Intertwingly.

The processing engine relies on harvest.rb for loading the template environment. Template variables are pure Ruby and have the same names as in Planet Venus (see docs/templates.html).

I ported the tmpl filter test cases from Venus. Haml passes 40 of the 51 cases. Many of the 11 that fail seem like they are related to encoding issues in harvest.

So...  1) Are the tmpl filter test cases appropriate given the difference between harvest and feedparser, 2) if so, any guidance about what needs changing--if anything--in harvest, and 3) because htmltmpl is in python, how you define htmltmpl compatibility?

Finally, thanks for sharing the fun.

Posted by Jim Holt at

Hi Sam.  Here’s a bug fix: [github]

Currently, if a feed is ill-formed, and you’re using libxml, it will be skipped rather than parsed with html5.  This patch fixes it:

diff --git a/planet/xmlparser.rb b/planet/xmlparser.rb
index b7d3093..729e287 100644
--- a/planet/xmlparser.rb
+++ b/planet/xmlparser.rb
@@ -31,7 +31,7 @@ module Planet
           doc = REXML::Document.new source
         end
         bozo = false
-      rescue
+      rescue Exception => e
         # If everything is being bozo'd, enable this to see why.
         # print "PARSE ERROR: #{$!}\n  #{$!.backtrace.join("\n  ")}\n"


Posted by Scott Bronson at

Hi again Sam.  I pushed another fix for run-on formatting.  This github page should show all the changes I hope you’ll accept: [link]

Posted by Scott Bronson at

Good catch, though I’d prefer something more along the lines of

if node.elements.size == 0 && node.text == nil
    node.text = '' unless HTML5::VOID_ELEMENTS.include? node.name
  end

It is not so much the retaining of the element that matters to me, but letting the HTML5 packages maintain the list of element names.

And more important to me is the test cases.  If you don’t get around to it, I’ll try to write up a test case each for these patches.

Posted by Sam Ruby at

Done

Posted by Sam Ruby at

Hmm, I guess you are running this one live at the moment? I’m still stuck in the dark ages running Venus, and I see a difference. I edited a spelling mistake I happened to notice in one of my old posts, and that caused the post to reappear at [link]. On the other hand, it didn’t reappear at [link]. I think the latter is the correct behaviour.

Posted by Ciaran at

planet.intertwingly.net is produced using Venus.  Your feed has:

  <updated>2008-07-10T22:30:05Z</updated>

Which, per the spec, indicates the most recent instant in time when an entry or feed was modified in a way the publisher considers significant.

Your planet is subscribed to your RSS 2.0 feed, where the correct behavior in this matter is undefined.

Posted by Sam Ruby at

Right. So the difference is that I’m subscribed to the rss and you’re subscribed to the atom. Thanks for taking the time to point that out.

It seems to me that for a Wordpress user, “a way the publisher considers significant” means “always”. I guess I could patch my Wordpress to make it let me control that at edit time.

Posted by Ciaran at

Idea 272, Ticket 5196

Posted by Sam Ruby at

You’re a star, thanks again. Shouldn’t you be asleep over there at this time? And how did you (according to my display) post your reply four minutes before my question?

a) Posted by Ciaran at 07:49:17
b) Posted by Sam Ruby at 07:45

And, why is that particular time the only one out of all the ones on this page that doesn’t have seconds? No need to answer any of this - I’ll find the answers to my own stupid questions this time.

Posted by Ciaran at

Shouldn’t you be asleep over there at this time?

I tend to be an early riser.  This morning, more so than most, apparently.

why is that particular time the only one out of all the ones on this page that doesn’t have seconds?

I’ve seen that before, intermittently; but have not been able to track it down.  Dates in the page are initially in GMT, and are converted on the client side to local time.

Posted by Sam Ruby at

Ok, after much cursing I see what’s happening. When you’re iterating through the ‘times’ array in localizeDates() you sometimes adjust the headers to correspond to the local time. When this happens, it causes the array to change - in the case above, the header adjustment happens after my comment “Right. So the difference....”, and after the adjustment, an element has been removed from the array. The loop iterator, i, however, is unchanged, so the net result is that your comment starting “Idea 272...” gets skipped and the loop carries on from the following comment.

Hope that makes sense.

Posted by Ciaran at

OK, I added two lines which will copy the live NodeList to a static Array before iterating over it.  You may need to refresh before you see the effect.

Thanks!

Posted by Sam Ruby at

That’s got it, cheers.

Posted by Ciaran at

Add your comment